LinkedDataOps:Quality Oriented End-to-end Geospatial Linked Data Production Pipelines

Tracking #: 3045-4259

Beyza Yaman
Kevin Thompson
Fergus Fahey
Rob Brennan

Responsible editor: 
Guest Editors SW for Industrial Engineering 2022

Submission type: 
Application Report
This work describes the application of semantic web standards to data quality governance in the architectural, engineering, and construction (AEC) domain for Ordnance Survey Ireland (OSi). It illustrates an approach based on establishing a unified knowledge graph for data quality measurements across a complex, quality-centric data production pipeline. It provides a series of new mappings between semantic models of the heterogeneous data quality standards applied by different tools and business units. The overall scope of this work is to improve the quality and service outcomes of an organization while conforming to the standards and supporting good decision-making through enabling an end-to-end data governance approach. Current industrial practice tends towards stove-piped, vendor-specific and domain-dependent tools to process data quality observations however there is a lack of open techniques and methodologies for combining quality measurements derived from different data quality standards to provide end-to-end data quality reporting, root cause analysis or visualization. This work demonstrated that it is effective to use a knowledge graph and semantic web standards to unify distributed data quality monitoring in an organization and present the results in an end-to-end data dashboard in a data quality standards agnostic fashion for the Ordnance Survey Ireland data publishing pipeline. This paper provides the first comprehensive mapping of standardized generic information systems data quality dimensions and geospatial data quality dimensions into a unified semantic model of data quality.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Umutcan Simsek submitted on 18/Apr/2022
Minor Revision
Review Comment:

The paper reports on an approach and an application for building a knowledge graph for quality assessment results. The knowledge graph mainly focuses on the harmonization of different quality assessment standards and enable stakeholders involved with geospatial data to have a better overview of the data quality from the perspective of different metadata standards.
I find the paper overall very good. The motivation and the potential impact is supported with reasonable arguments and evaluated to provide sufficient evidence. The resources are published on GitHub and appear to be complete. The only significant criticism I would have to this paper is that it goes beyond an application report, which causes that the arguments about the actual impact of the application to remain rather superficial.
The abstract and introduction gives the impression that the paper is mostly about harmonizing different metadata standards for quality and provide data lineage to observe the change in data quality throughout a pipeline to help decision making. It becomes only quite later clear, that the approach also involves introduction of new metrics, which is not motivated as good as the need for the harmonization of the metadata standards. In the evaluation, even though I understand the qualitative discussion regarding the newly introduced metrics, the evaluation of existing datasets in 5.2 does not really serve to the purpose of this paper. The discussion about how the application is perceived by the stakeholders and whether the application answered the question in introduction and requirements stated in Section 2 is somehow hidden in the lessons learned section. To summarize, I have the following revision suggestions:
- better motivate the introduction of new metrics, if these are a core piece of the approach
- Give more details about how the feedback obtained from the OSi members and discuss the application in the scope of the collected requirements.
- also on a more of a side note: the subsumption relationship between Complete and Completeness dimensions from two different standards can be explained a bit better. From the definitions, it is not necessarily clear that Complete is narrower. (I guess 1:1 match does not require all attributes to appear on each instance?)
Other than that I have some even more minor comments regarding the clarity and presentation:
- The OSi acronym is introduced arbitrarily, it is already used in the introduction, then introduced in Section 2 twice.
- Section 3 is more written in a literature review / related work fashion. Maybe renaming it makes sense.
- page 1 line 47: has -> have, line 48: role *in* the way
- page 5 line 39: Amrapali et al -> Zaveri et al
- footnote 14 seems to be left there to be converted into a citation one day :)
- 4.1 is not really a step in the approach but more of an overview. the actual description of the technical architecture starts from 4.2, which may be a bit confusing.
- page 9 line 28: Listing 4 -> Listing 3
- page 10 line 29: level level
- page 11 line 28: narrow -> narrower
- I am not sure if Figure 6 is cross-referenced anywhere in the paper

Review #2
Anonymous submitted on 20/May/2022
Minor Revision
Review Comment:

# General comments

This paper presents an application report where Semantic technologies are used to create a quality assessment framework for geospatial data, taking into account multiple (geospatial) quality standards. It provides semantic definitions and interlinking of concepts for multiple quality standards, which enables quality assessment from multiple perspectives. The work is motivated by the geospatial data publishing pipeline use case, that takes place at the Ordnance Survey Ireland (OSi) organization, and builds upon it to enrich data transformation processes with richer and more granular quality assessments at different data life-cycle stages.

Resources (ontologies, mappings and dashboard source code) are made publicly available which constitute a valuable contribution for the research community. Produced KGs were not accessible at review time though.

Although the paper does not provide a section labeled as Related Work, is clear that Section 3 (Background) provides references not only to technologies and concepts this work relies upon, but also to other related work, highlighting their shortcomings and how the authors' work does a contribution.

However, my biggest surprise was that there was not a single mentioned of SHACL (or alternatives like ShEx) as a W3C standard for performing Linked Data validation and quality assessments. I can understand if implementation decisions to opt for OWL definitions were made when SHACL was still not a mainstream technology, but not that is completely ignored in the paper, specially when the topic is directly related to it. I suggest to the authors, at least to consider discussing their perspective on how SHACL/ShEx could contribute/change/relate to their current work.

Concretely I would suggest the paper to go for a minor revision. Next you may find more detailed comments, regarding the review dimensions for application report papers and comments specific to particular sections.

## Quality, importance and impact

The paper provides a very well detailed description of the application which is also well motivated, with a compelling use case. The work also provides a clear demonstration of the usefulness of Semantic technologies as enablers of data integration. Therefore, I consider the work is of high quality and importance, especially taking into account the efforts made for interpreting and interlinking multiple international data quality standards and the technical implementation efforts for Knowledge Graph generation, and creation user interfaces (dashboard) powered by semantically annotated data.

In terms of impact, this work seems to have contributed to change/improve the internal data workflows of OSi. However there is not concrete evidence that the solution introduced by this paper is still actively used. Specially considering that the knowledge graphs produced by this work are offline (basically everything under domain).

## Clarity and readability

Significant improvement on the quality of writing is required. In particular the Introduction and the Evaluation sections require a thorough check. I found multiple typos (listed below) but there are probably more. An automated grammar check could already help detecting many of them.

# Detailed comments

## Introduction

Introduction needs a thorough revision on its writing (mostly at the beginning). Multiple grammatical errors are present (some pointed below) and ideas are lacking connectivity.

> initiatives have focused decentralized data lakes among datasets.

This sentence does not make sense. How does one focus decentralized data lakes among datasets? I think this needs to be rephrased.

## Section 3

> OGC equivalence of the documents can be seen on the right hand side of the table.

What table are you referring to?

## Section 4

In section 4.2 you refer to Listings 1-4 saying that they show how error ORA13349 is mapped, but the Listings actually describe error 13356.

> Materialized data is presented on Listing 3....The quality measurement is computed using R2RML function demonstrated in Listing 4.

The Listing references are inverted. Listing 3 shows R2RML and Listing 4 materialized data.

What is called Table 2 is actually a Listing.

## Section 5

> ...subset were also assessed to provide a common baseline for comparison between dataset

This sentence is not very clear. What does it mean providing a common baseline with subset? Please clarify what exactly is meant here.

> By applying a suite of over 500 quality rules to the PRIME2 topographic dataset it is possible to assure very high levels of compliance...the use of probabilistic (sampling-based) metrics as deployed in Luzzu for computationally expensive metrics is an advantage.

There is no mention of probabilistic metrics in the section where metrics are defined and described. This whole paragraph should be part of Section 4 and more details should be given about this.

> Average aggregated results shows around 50% completeness for the overall data at most for these datasets

This needs rephrasing for clarity.

> ...especially, when conformance is a straightforward change in the prefixes in the data.

Elaborate on this, what does it mean change in the prefixes in the data? Is this referring to changes in @prefix declarations of turtle files to change vocabularies?

> Given that datasets are not updated for a long period, indeed, LOD cloud is about to become a museum for the datasets [12].

This sentence seems to be out of place here.

## Section 7

> It was seen even though there is a lot of standards representation in Linked Data that it is still in the transforming phase and they are mostly disconnected

This needs rephrasing for clarity.

## Typos

Page 1, Line 44: " has been..." → " have been..."

Page 1, Line 47: "...technologies has played..." → "...technologies have played..."

Page 2, Line 8: " which provides..." → '' which provide..."

Page 4, Line 37: "...more significant more than..." → "...more significant than..."

Page 4, Line 41: "...advises managing..." → "...advises for managing..."

Page 7, Line 21: "naively" → "natively"

Page 9, Line 20-21: "As a result the the relational..." → "As a result the relational..."

Page 10, Line 29: "...instance level level..." → "...instance level..."

Page 13, Line 33: "In order to to..." → "In order to..."

Page 16, Line 21: " in our..." → " our..."

Page 17, Line 45: "...Linked Data standards..." → "...Linked Data, standards..."

Review #3
By David Chaves-Fraga submitted on 25/May/2022
Review Comment:

This paper describes a framework to integrate, analyze and exploit quality metrics of geospatial data using a semantic web approach. The authors propose the use of several semantic web technologies and standards to integrate the wide amount of non-semantic data models in the geospatial domain. The main use case where this framework has been implemented is the Ordnance Survey Ireland (OSi) mapping agency. The topic fits very well with the special issue. I use the recommendations of the journal for these kinds of papers (application report) to review the paper.

(1) Quality, importance, and impact of the described application
The paper tackles a very important problem in the geospatial domain, which is the wide amount of different standards and ways of representing the quality of the data. The idea of using semantic web standards in the project to integrate all of them together as well with sustainable data pipelines is very interesting. In addition, the authors mention that OSi employees see the application as very useful for their work. I really liked the work of mapping the different standards (table 3), which I think is the most relevant contribution of the paper. My main concern here is the impact that can have the tool beyond the OSi agency, i.e. it is neither demonstrated nor mentioned if this framework can be integrated into other agencies or with other datasets. There is not a clear motivation for the use of dedicated software (1Spatial 1 Integrate and Luzzu) which may impact negatively on the adoption of the framework in other contexts.
I also did not understand the decision of using R2RML-F for materializing instead of a more declarative and sustainable approach using RML+FnO[1], with the description of the transformation functions in a declarative way and the possibility of using other tools such as FunMap[2], Ontop[3] or Morph-KGC[4] which have been already demonstrated that are more efficient[4,5]. I really missed during the paper a related work section to compare the proposed approach with others from different domains where similar problems are tackled, and I still do not understand why there is not any reference to GeoTriples work[6,7] and Ontop_spatial[8].
Finally, I visited the resources provided by the authors using Gogs, but non of the resources follow good open science practices (no license, no description, no DOI, etc), so reusability is not ensured.

(2) Clarity and readability of the describing paper
The paper is very difficult to follow, there are many long sentences, with repeated concepts within them and where the ideas are not clear. I think the authors wanted to demonstrate all the effort they did during the development of the project but the paper does not clearly show it. The general structure is very difficult to follow, with many confused ideas and concepts. For example, it is not clear which is the main contribution of the paper bc the authors repeated several times along the introduction with different aims, questions, or vague ideas (governance of data, data pipelines, automatic approaches). The lack of clarity in transmitting the ideas is clear when there are 8 different contributions.

The readability of the paper is not very high, as there are sentences that do not have sense or concepts that are not introduced properly. For example p1-c50-second column “There is a common feature of all these systems that these applications need unification of high quality geospatial data, computer methods and domain knowledge to provide high quality results”. What high quality results mean? Results about What? Queries? Analysis? And I could extract similar examples from many other parts of the paper (see p17-c34-second column where the sentence has almost 8 lines without commas or dots). In addition, there are many typos and inconsistencies within the paper (mentioning only a few of them): listing 3 is referenced as materialized data when actually is a transformation function, Listing 4 contains RDF errors (no datatypes, no quotes for plur:value property, etc), Table 2 is actually a Listing, why is relevant to mention the features of the computer where the experiments were run?. I would encourage the authors to dedicate time to the details of the paper as the work presented is very relevant and can have a high impact on the community but the clarity of the ideas and the way how the paper is written needs to be changed.

Overall, I think the status of the paper is not ready for acceptance in this journal and it would need a complete rewritten process, beyond what is expected as a major revision.

[1] Meester, B. D., Maroy, W., Dimou, A., Verborgh, R., & Mannens, E. (2017, May). Declarative data transformations for Linked Data generation: the case of DBpedia. In European Semantic Web Conference (pp. 33-48). Springer, Cham.
[2] Jozashoori, S., Chaves-Fraga, D., Iglesias, E., Vidal, M. E., & Corcho, O. (2020, November). Funmap: Efficient execution of functional mappings for knowledge graph creation. In International Semantic Web Conference (pp. 276-293). Springer, Cham.
[3] Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., ... & Xiao, G. (2017). Ontop: Answering SPARQL queries over relational databases. Semantic Web, 8(3), 471-487.
[4] Arenas-Guerrero, J., Chaves-Fraga, D., Toledo, J., Pérez, M. S., & Corcho, O. (2022). Morph-kgc: Scalable knowledge graph materialization with mapping partitions. Semantic Web.
[5] Arenas-Guerrero, J., Scrocca, M., Iglesias-Molina, A., Toledo, J., Gilo, L. P., Dona, D., ... & Chaves-Fraga, D. (2021). Knowledge graph construction with R2RML and RML: an ETL system-based overview.
[6] Mandilaras, G., & Koubarakis, M. (2021, October). Scalable Transformation of Big Geospatial Data into Linked Data. In International Semantic Web Conference (pp. 480-495). Springer, Cham.
[7] Kyzirakos, K., Savva, D., Vlachopoulos, I., Vasileiou, A., Karalis, N., Koubarakis, M., & Manegold, S. (2018). GeoTriples: Transforming geospatial data into RDF graphs using R2RML and RML mappings. Journal of Web Semantics, 52, 16-32.
[8] Bereta, K., Xiao, G., & Koubarakis, M. (2019). Ontop-spatial: Ontop of geospatial databases. Journal of Web Semantics, 58, 100514.