Towards an Ontological Approach for Integrating Declarative Mapping Languages

Tracking #: 2913-4127

Authors: 
Ana Iglesias-Molina
Andrea Cimmino
Edna Ruckhaus
David Chaves-Fraga
Raúl García-Castro
Oscar Corcho

Responsible editor: 
Tania Tudorache

Submission type: 
Ontology Description
Abstract: 
Nowadays, Knowledge Graphs are extensively created using very different techniques: tools such as OpenRefine, programming scripts or mapping languages, among them. Focusing on the latter, the wide variety of use cases, data peculiarities, and potential uses has had a substantial impact in how they have been created, extended, and applied. This situation is closely related to the global adoption of these languages and their associated tools. The large number of languages, compliant tools, and usually the lack of information of the combination of both leads users to use other techniques to construct Knowledge Graphs. Often, users choose to create their own ad hoc programming scripts that suit their needs. This choice is normally less reproducible and maintainable, what ultimately affects the quality of the generated RDF data, particularly in long-term scenarios. In this paper, we present the Conceptual Mapping, built as an ontology and designed to represent the expressiveness of existing mapping languages to construct Knowledge Graphs. We devise with this ontology an enhancement to the interoperability of existing mapping languages, which may be achieved in different ways by ensuring the representation of other mappings and by allowing translation among them. This language is built as a result of a thorough analysis of the features and capabilities of current mapping languages, which is presented as a comparative framework.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ben De Meester submitted on 10/Dec/2021
Suggestion:
Major Revision
Review Comment:

There are some minor grammar errors (in general, I would suggest reviewing the use of the prepositions, they are not always correct) or unfortunate phrasings,
but overall, the paper is clearly written. However, I think the reproducibility of the method and validation of the result is not fully discussed,
and I'm having a hard time trying to review the following recommended indicators:
- the design principles (throughout my comments below, you can see that I'm a bit confused about what the scope of the Conceptual Mapping is supposed to be, and thus question some design decisions)
- comparison with other ontologies on the same topic (other ontologies are extensively discussed, but a clear alignment between the Conceptual Mapping and these other ontologies is missing)
- and pointers to existing applications or use-case experiments (this seems to be future work)

More specifically:
- The comparison framework: it is currently unclear what this entails, however, I have the feeling you were rather "collecting stamps" than "physics" (I don't want to sound negative, both have merits): analyzing existing languages and extracting their features, and not trying to come up with a complete set and map that to languages.
Please be very clear about what you are doing, clarify, and provide argumentation to make your work more reproducible (e.g., how do you detect certain features: based on the specification, a set of test cases, verified expert opinion, ...?).
- The mapping language does not seem validated: please provide argumentation as to why certain modeling decisions were made, and provide some proof that the mapping language is validated. As this is an ontology paper, I currently have trouble finding the proof for "Quality and relevance of the described ontology (convincing evidence must be provided)".
- Some of the arguments of the discussion section come out of the blue for me, e.g. the provenance, applicable shapes, extensible character: I fail to understand how these have been extracted from existing mapping language features, and if not extracted, why they are added.

That said, the work is valuable, relevant, and timely. I believe the hardest work has been done, and many of the comments I have are requests for clarifications.
However, without a validation of the actual result (i.e., a comparison/alignment with the other languages, to showcase you indeed cover all bases),
I would not be inclined to accept it, hence I suggest a major revision.

Introduction

- I really like the conclusion of p2 left column line 26
- Please clarify whether you made an ontology to unify definitions across mapping languages, or a mapping language that is a superset of existing mapping languages. This is currently not clear. Given this was submitted as an Ontology Description I assume a "short paper describing ontology modeling and creation efforts", however, I have the feeling this is not the case here. If this instead is supposed to be a full paper, I would assume to review "originality, significance of the results, and quality of writing [...] and more specifically the evaluation sections in a style and level of detail that enables the replication of their results".
- I commend for clarifying the usage rights, persistent identifier, and clear documentation of the ontology. However, I found the following errors:
- I'm quite surprised a mixture of rdfs:comment and skos:definition is used for the ontology, that doesn't feel right.
- http://vocab.linkeddata.es/def/conceptual-mapping/protocols_list.ttl returns a 404

Related work

- For 2.1, I would appreciate some discussion on which criteria you used to select this set of mapping languages to compare with, and not, e.g., SPARQL-Anything, XRM, or SMS2.
- In 2.3, for me, it is not clear why in Mapeathor the spreadsheet is language-independent. Mapeathor imposes a specific structure within the spreadsheet (is that not 'a language'?), very similar as to how, e.g., YARRRML imposes a specific structure within a YAML document. Neither change the underlying serialization (and for completeness, YARRRML also supports translation into R2RML). That said, this kind of discussion is quite interesting: what kind of distinction is there between e.g. Mapeathor and YARRRML vs RML and SPARQL-Generate? What changes if someones builds, e.g., a tool that directly works on YARRRML mappings instead of translated RML? In general, I'm not sure this distinction is relevant for this paper.
- For 2.3, I miss a concluding remark, for now, I don't understand why this section is included (translation is a future work, AFAICT)

Comparison framework

- You state which languages you include, but there's no argumentation why those (similar comment than my first one of the related work section). Can you provide a more rigorous argumentation as to why you include specifically those?
- The example is a quite limited RDF structure -> no rdf:Lists or similar constructs, no graphs. I'm currently not convinced this is a complete example that touches your complete ontology/language.
- It is for me not clear whether this language/ontology is meant to be abstract (so tries to attempt completeness), or rather a superset of existing languages (so tries to cover everything that existing languages cover). There is a distinction between these two, so clarifying that scope is important.
- For example: Data retrieval: the fact that there are 3 retrieval 'modes': is this complete, or is this extracted based on the features of existing mapping languages? [1] describes factors that influence an RDF graph generation algorithm, and makes the distinction between 'real-time' and 'on-demand' trigger, where I can see that your distinction 'Streams' maps to 'real-time', and 'Asynchronous' and 'Synchronous' are two types of 'on-demand' triggers (event-based could be a third type of on-demand trigger). I'm not saying one categorization is better than the other, but some argumentation would be good. If the point above is well tackled, this point should become a non-issue.
- For the data source description, I would expect a discussion on the extensibility of the language vs support of tools implementing the language, e.g. RML does support Streaming data sources (https://github.com/RMLio/RMLStreamer#processing-a-stream), but this is indeed not specified in the original paper or specification, since RML provides an extension point concerning data source descriptions, and is only implemented in the RMLStreamer. How could you compare such features on a language level?
- p8 right line 43: please clarify provenance, it is not clear. Given that you state "No language considers the specification of its provenance", I have the feeling you attempt to create a complete set of features. I would strongly suggest not going down this path and instead trying to create a superset that (only) contains the features that are currently supported by your set of mapping languages. Otherwise, I would need some argumentation (and preferably, theoretical grounding) why some features are taken into account, and why some aren't.

Mapping Language Ontology

- For reproducibility, I would expect that the requirements specification is (publicly) available
- p11 left line 9: so the set of functions a mapping may use is predefined, not extensible. How is Conceptual Mapping then an abstraction of existing features? (e.g. FunUL supports function extensions)
- I think the mapping validation is important, however, I could not find the results described or linked to in the paper. Based on what I read up till now, I would expect some validation document that states 'feature X is supported by construction Y (or a combination of constructions)', and as such, you can prove that you cover all features. Or, if impossible, argue why this is not provided. Otherwise, I cannot review "whether the provided resources appear to be complete for replication of experiments, and if not, why".
- For example, it is unclear to me which feature the CombinedFrame construct solves.
- How do you specify that an expression is either an XPath or a JSONpath, or 'among others'? Why is this not a SKOS ConceptScheme?
- By linking the datatype to the statement, don't you get in trouble when you want to create mixed-type rdf:Lists?
- From the features listed in Section 3, I don't understand why the Ontology or shapes constructs were required.

Discussion

- p15 left line 35: the problem of 'lack of information on valid combinations' has not been clearly explained up till now.
- p15 left line 45: the purpose of this language has, up till now, not been validated: there's no overview of how existing mapping language features are supported by conceptual mapping language.
- p15 right line 3: please exemplify and better argue why provenance definition and shape applicability are relevant in this paper. For now, it is unclear. Same with extensibility: currently it's a bit unclear this is possible (I assume you mean you can add metadata triples to the existing mapping graph? is that an extracted feature?)
- I kind of disagree that mapping governance has not been developed so far, e.g. http://events.linkeddata.org/ldow2016/papers/LDOW2016_paper_04.pdf (last page) showcases how author metadata could be added, using PROV-O statements.
- I find the maintenance guarantee a bit underpromising (who is "we" in this case?), can this be made stronger, e.g., some public statement on Github, an endorsement by an organization instead of a group of people?

Conclusion and Future Work

- "an ontology-based conceptual model that aims to gather the expressiveness of current mapping languages" --> so why also include other features (such as provenance) that are mentioned in the comparison framework to not be considered by any language?
- "Finally, we want to specify the correspondence of concepts between the considered mapping languages and the Conceptual Mapping" --> I'm very much confused by this statement, then what is the Conceptual Mapping at this point? How can you be sure that it currently gathers the expressiveness of current mapping languages if you don't have this correspondence table?

-- spelling/grammar/details

I personally prefer using Oxford comma consistently, for both 'and' and 'or'.
I tagged some phrases that were unclear in the PDF at link https://www.dropbox.com/s/4hr7lobx1fm006v/swj2913_bdm.pdf?dl=0

Review #2
By Herminio Garcia-Gonzalez submitted on 15/Dec/2021
Suggestion:
Major Revision
Review Comment:

The present paper introduces a new ontology that aims to serve as an agnostic representation of heterogeneous data mapping languages which generate Knowledge Graphs. For this purpouse the authors carried out a deep study of the existing data mapping languages to then produce a kind of meta-language which promise to be able to represent the functionalities of these languages in a language-agnostic way. The objective of this paper is a very valid and strong one, as through the last years more and more data mapping languages have appeared, but unfortunately there are no further convergences, letting alone users to explore this overwhelming field. However, I found somewhat the arguments and motivation used in the paper not strong enough.

It is claimed that the ontology is able to represent all of the languages expresivity, however I would argue that this is a very difficult goal to achieve as the variablity and difference in functionality and expressivity among the languages could be huge. For example, looking to SPARQL-Generate which offer a big functionality and flexibility, it is possible to indicate an iterator (e.g, from 1 to n) to harvest all the data in an API. Looking to the ontology I don't see a way to represent this without loosing the semantics. Additionally, as mentioned in Section 3, xR2RML offers the possibility to push down a value further in a JSON tree-structure in order to cover the inability of JSONPath to get parent nodes. Again, I do not see how this could be represented following the proposed ontology. Therefore, I think that the real scope should be clarified alongside the problem of losing expressivity and how this would enable backwards conversion without losing the original semantics.

Another interesting point is the argument that this ontology would serve as an interexchange language but also as a meta-language that would prevent users to tie to a specific language or engine. It is true that if widely supported this ontology would indeed cover the translation among languages, but also it would need to cover very specific cases and be very flexible which in the end would translate into adding more characteristics to the original ontology. Therefore, this would end up in a even more verbose language which, as being a meta-language, it is already more verbose than existing ones. So, I do not see how the usability would be fostered if users need to spend more time producing the mapping rules. I am also very interested in the rationale behind creating a new language (although its main objective is to be a meta-language), instead of trying to improve an existing language incorporating features from other languages.

Then, in Section 3, I really liked the great effort surveying the existing languages and I think that the tables in Appendix A are of great value. But, I miss two things here, firstly I would add Facade-X (aka SPARQL-Anything) to the comparison as it is a very interesting and new approach. Then, I think that in some parts the text only remarks what is possible and not in the language, while a brief discussion of the pros and cons of these decisions would really benefit the text but also a newcomer to the field. For example, in Section 3.3 Linking Rules, you describe how different languages can make links. However, it is intuitive that if any language is able to generate dynamic subjects then the link will be automatically build if the resulting subject and object URIs are the same. So, it is interesting to further discuss the benefits of using other constructions.

Some other remarks:

ShExML is included as an RDF-based language which is not. ShExML is based in Shape Expressions but unlike ShEx it does not have an RDF syntax. Also, in Listing 3, the ShExML mapping rules would not work as the "$." included in FIELDs is not required.

In the introduction you mention "Deciding which language and technique should be used in each scenario [...] that reduce reproducibility, maintainability and reusability." I guess that the real problem is not having the right tool to do the mappings and that the analysis of time-benefit for using and learning these tools is higher than going for ad-hoc solutions. I think that finding one tool that suits all the needs is quite utopic and then different tools should be used (if available). Then, thanks to the RDF compositional property the results can be merged seamlessly which is a big benefit when compared to data integration in other formats.

In Section 2.2 you cite [21] as an initial analysis of 5 mapping languages. I think the correct one is:
De Meester, B., Heyvaert, P., Verborgh, R., & Dimou, A. (2019). Mapping languages analysis of comparative characteristics. In 1st International Workshop on Knowledge Graph Building, co-located with the 16th Extended Semantic Web Conference (ESWC 2019) (Vol. 2489).

In Section 5 you mention the benchmarks as one possible use case in order to not having to produce mapping rules for each language being tested. However, in a benchmark you would need to use specific features in order to really asses the performance of this language/engine, if you rely in translation then you are introducing a bias which is the translation itself that can be better or worse depending in the target language.

Also in Section 5 you say: "This situation is directly related to the adoption of these technologies". This is a strong claim, just support it with some references or drop it.

In Section 6 you conclude by saying "where mapping rules are first-class citizens". Although this is a well-known term in programming languages theory it is not clear what you mean here.

You should review carefully the format of the references. Some mistakes: [4] no venue/journal, [10] no authors, [13] iot (capitalise)

Just out of curiosity, how would the ontology support hierarchical data, namely accessing nested levels in the hierarchy? Would it be using join conditions or has it some direct support?

A very picky one: You mention in Section 2 that YARRRML is a serialization format for RML, in that sense I would expect to be able to convert back and forth from and to RML but it is not the case in its conception. Therefore, I would define YARRRML more like a compact-syntax for RML rather than only a serialization format.

Typos:

Abstract:

capabilities of current mapping languages -> capabilities of the current mapping languages

Introduction:

This analysis studies how languages describe the access to data sources, how triples are created and their distinctive features: and is presented as a comparative framework. -> This analysis, presented as a comparative framework, studies how languages describe the access to data sources, how triples are created and their distinctive features.

Related work

existing mapping languages, summarized in Table 1 -> existing mapping languages, listed in Table 1

Comparison framework

the main features included in mapping languages -> the main features included in the mapping languages

General features for Graph Construction:
Generate feature that apply yo statements -> Generate feature that apply to statements

Review #3
By Jose Emilio Labra-Gayo submitted on 10/Feb/2022
Suggestion:
Major Revision
Review Comment:

The paper describes an ontology about mapping languages that captures the commonalities between different mapping languages and technologies like R2RML, XSPARQL, CSVW, ShExML, etc.

Given that the paper has been submitted as an ontology description, I will use the criteria defined by the journal to assess these papers.

(1) Quality and relevance of the described ontology.

The ontology has been defined to give support to the problem of categorizing different mapping technologies that have appeared in the last years. The exercise of defining this ontology can shed light on this topic and is relevant as it can improve existing approaches.

The ontology has been developed following a well-established methodology and the ontology itself follows best practices.

One possible drawback is that the ontology seems to have been defined by members of the same research group, without a proper consensus work involving external stakeholders which could help improve the acceptance of the ontology.

In the discussion, the authors say that “we guarantee that the ontology will evolve incorporating new features of the mapping languages, ensuring transparency and the correct translation among them. For example, at the moment of writing, the RML mapping language is evolving towards an updated specification throughout the W3C Community Group on Knowledge Graph Construction.” Although that is indeed a very good ambition, there is no clear evidence about it. It would be more natural that the ontology could already handle the different extensions and proposals that are being discussed in the Community Group. In that sense, it would be great if it was possible to add some provenance information to the concepts included in the ontology to the discussions about those concepts in the Community group or in other venues, or at least some mechanism to track the relationship between them.

(2) Illustration, clarity and readability of the describing paper.

Overall, the paper is very readable and the ontology has been defined following a sound methodological approach. The ontology has been published using tools like Widoco and Ontoology and has a dereferentiable URI at: http://vocab.linkeddata.es/def/conceptual-mapping#

The ontology is available under a github repository and contains a README, although the README contains only a picture of the ontolog. I would recommend the authors to improve the quality of the README adding more information about how to contribute to the ontology, what methodology has been used, some examples, etc.

There is not enough information about which ontology patterns have been used in the design of the ontology.

The authors indicate that they have been using competency questions and statements that have been validated by the stakeholders, however, I couldn’t find information about which competency questions were used or which stakeholders have been participating in the process.

They also say that the ontology was evaluated with OOPS! And that some minor issues were found and fixed. It would be nice if those issues or new issues appeared in the github repository and in general, if the ontology development process followed some practices commonly employed for open source projects using github issues. At the moment of this writing, I could only find one issue…I encourage the author to keep maintaining the ontology and creating more issues documenting that process.

The authors declare that a real use case is the development of the GTFS-Madrid-Bench where different mapping languages have been used. However, it is not clear if the authors have also used the ontology instances that define those mappings and if there could be some translation between those ontology instances and the real mappings. In my opinion, the github repo could improve if it contained some examples of those mappings defined as ontology instances and converted to the real mappings.

(4) whether the provided data artifacts are complete.

In my opinion this ontology is a first version of an ontology proposal which has not yet been reviewed or discussed by the community and the stakeholders. Although the idea may be interesting, it would be important to define some tools that could be used to translate instances using the ontology to real mapping scripts, which could help assess the completeness of the concepts represented in the ontology.

The approach followed by the authors based on the creation of an ontology instead of a mapping language is interesting but it is not clear what are the advantages of this approach over other approaches in practice. I mean, although having a common ontology to define the different mapping concepts is beneficial to understand better the domain, it would also help if there were some tools that could convert existing mapping languages to instances of that ontology, and vice versa, from instances of that ontology to some existing mappings.

Some minor typos/comments:
Figure 3.b contains “La Coruña” while table 3.c contains “A Coruña”. I am not sure if the authors provide some tools to unify both names which would also be interesting, or it is just a typo.
In page 11, the authors say that a source frame corresponds to a data source and defines which is the data in the source that is retrieved and how it is fragmented with expression using XPath, JsonPath, etc…how are those different expression languages distinguished? Would the ontology need another property to declare the language employed in those expressions?
Page 13, the mapping use the data sources…
Page 14, listing 9 uses the prefix cmf in, for example, cmf:concat, however, I think the prefix cmf is not declared in the ontology
In the discussion section, the authors say that conceptual mapping can ease the knowledge graph generation process. However, it is not clear why the instances of the conceptual mapping ontology like, for example, the running example in listing 9 and 10, could be more usable than the ShEXML example in listing 3. Maybe, the authors could add some discussion about how to improve the usability of those mapping languages.
The authors also declare that the GTFS-Madrid-Bench uses mappings in R2RML, RML, CSVW and xR2RML that were created manually and that having tools that allow translation between those languages would facilitate those tasks…however, the current proposal seems to be only an ontology. Are the authors considering creating such translators between the ontology instances to those mappings? If it is still work-in-progress, I am not sure if the authors consider a better approach to use an ontology or if they have considered defining an intermediate language instead of the ontology, which could be easier to manage than ontology instances in RDF. Maybe, some discussion about alternatives to solve this issue could improve the paper.
Reference 2 contains J. E. L. Gayo, when it should better be J. E. Labra-Gayo and Reference 10 lack the authors.