Not Everybody Speaks RDF: From Knowledge Graph Construction to Knowledge Conversion between Different Data Representations

Tracking #: 3753-4967

Authors: 
Mario Scrocca
Alessio Carenini
Marco Grassi
Marco Comerio
Irene Celino

Responsible editor: 
Guest Editors KG Construction 2024

Submission type: 
Full Paper
Abstract: 
Knowledge representation in RDF guarantees shared semantics and enables interoperability in data exchanges. Various approaches have been proposed for RDF knowledge graph construction, with declarative mapping languages emerging as the most reliable and reproducible solutions. However, not all information systems can understand and process data encoded as RDF. In these scenarios, to guarantee seamless communication there is a need for a further conversion of RDF graphs to one or more target data formats and models. Existing solutions for the declarative lifting of data to RDF are not able to effectively support knowledge conversion towards a generic output. Based on an examination of existing mapping languages and processors for RDF knowledge graph construction, we define a reference workflow supporting a knowledge conversion process between different data representations. The proposed workflow is validated by the mapping-template tool, an open-source implementation based on a popular template engine. The template-based mapping language enables the definition of mappings without requiring prior knowledge of RDF and provides flexibility for the target output. The tool is evaluated qualitatively, considering common challenges in the declarative specification of mappings, and quantitatively, considering performance and scalability. This paper extends a previous version of this work by integrating a discussion of the proposed workflow considering the analysed state-of-the-art for knowledge graph construction, introducing the tool's direct support for the execution of RML mapping rules, and describing a more comprehensive qualitative and quantitative evaluation, also considering the results obtained by participating in the Knowledge Graph Construction Challenge 2024.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Enrique Iglesias submitted on 31/Oct/2024
Suggestion:
Major Revision
Review Comment:

General Comments
Summary
This paper introduces a workflow that facilitates knowledge conversion into different data representations. The workflow uses a template-based mapping language that allows users unfamiliar with RDF to define mapping rules. This paper presents the mapping-template tool, an open-source implementation capable of processing the template-based mapping language. Additionally, an extensive experimental study is used to compare the performance of the mapping-template tool to the current state-of-the-art. The authors seek to extend a previously presented work in the Knowledge Graph Construction Workshop in ESWC 2024 called “Not Everybody Speaks RDF: Knowledge Conversion between Different Data Representations” by including a larger empirical evaluation study, a discussion of a proposed workflow considering the current state-of-the-art, and the introduction of the tool’s support for the execution of RML mapping rules.

This paper is well-written and goes into extensive detail about the functionality of the mapping-template tool. However, one major issue is against it: this paper overlaps a great deal with the authors' previous work, “Not Everybody Speaks RDF: Knowledge Conversion between Different Data Representations.” The titles are incredibly similar, with the tag line “Not Everybody Speaks RDF” and “Knowledge Conversion between Different Data Representations.” All the sections and sub-sections are named the same (except for one sub-section). Figures 1, 3, and 4 and their corresponding captions are the same as in the previous paper. The qualitative study is mostly the same, with just some quality-of-life improvements, like introducing a table that helps describe the test cases. The conclusions are the same, only with different future work. To be fair, in the quantitative study, there is a whole new set of experiments based on the Knowledge Graph Construction Challenge 2024, where the performance of the mapping-template tool is compared to the other participants of the challenge (i.e., FlexRML, RPT/Sansa, and RMLStreamer). Unfortunately, the first set of experiments in the quantitative study section is the same as in the previous paper. Additionally, the authors introduced a new sub-section that discusses how the workflow presented in section 3 supports existing mapping languages and tools.

In conclusion, while the paper introduces some new content, including a new set of experiments and a discussion on workflow applications, it still closely mirrors the previous work in key areas. The authors should consider extending or replacing the empirical study (specifically the qualitative study) with new experiments. The paper needs extensive reworking/rewriting to reduce the significant overlap with the previous paper.

Section Comments
Section 1: Introduction
Positive: This section introduces the work. It highlights the shortcomings of current state-of-the-art tools in terms of prior knowledge and difficulties in using them. Additionally, it gives a brief overview of the proposed approach and its benefits and describes the paper's structure.
Negative:
The introduction in this section is very similar to that in the previous work; it needs to be rewritten.
This section needs to highlight how this work differs from the previous one and has to be in much more detail than what is said in the abstract.

Section 2: Preliminaries and related work
Positive: This section presents all the necessary background knowledge to understand this paper and highlights some essential works in the area of knowledge transformation, such as SDM-RDFizer, Morph-KGC (which is used in the experimental study), FunMap, etc. Additionally, the authors have included newer works like FlexRML.
Negative:
The fact that this paper and its predecessor have the same background knowledge, but unfortunately, how the content is written is very similar. It would be beneficial that authors introduce additional concepts like “mapping lifting,” it is mentioned in the introduction and is given a small description but nothing more than that.
Some works are mentioned by the reference number and not by authors. For example, “The paper from [39]” and “RML is proposed in [23].”
The related work is very similar to what was presented in the authors’ previous work. The authors did add additional works to the related work, mainly the engines they used for the new experiments of the quantitative section. Unfortunately, the content of the related work is very lackluster. The authors mentioned many different works in the area of knowledge graph creation and data transformation, but they do not detail the pros and cons of these tools. For example, Morph-KGC uses concurrent processing to generate a knowledge graph, but what are this approach's positive and negative points?
There has to be a comparison between what is being proposed in this work and what is being presented in the related work. How does the behavior of the mapping template differ from that of other engines? How does it benefit/affect its performance in comparison to other engines?
The authors mention all the available extensions, but they need to explain their purpose in detail or provide the necessary background knowledge to fully understand their function. For example, the RML-Star extension is described as allowing users to create RDF-Star triples and nothing else. How does this extension do this in comparison to the normal RML? What is an RDF-Star triple? These things are mentioned but not explained.
The knowledge graph creation and data transformation tools would benefit from having their sub-section in the related work section.
In sub-section 2.3, there is “In previous work”. It should be “In a previous work”

Section 3: A workflow for declarative knowledge conversion
Positive: This section details the workflow proposed in this work, illustrating and describing all three phases, as well as providing a figure that helps visualize what is being described. In addition, the authors discussed how the presented workflow could help existing approaches.
Negative:
This section needs an introduction that describes what to expect in this section.
Since the authors originally described the workflow in their previous work, the three sub-sections in this work that describe it are very similar to what is described in the last work. These sub-sections need to be rewritten to reduce the overlap.
Subsection 3.4 needs to be named something other than “Related work discussion.” This would suggest that it should be in the related work section, not this one.
Table 1 has a couple of things that need to be either better explained or I feel that are missing:
Why do some mapping language specifications not have a defined data source specification? For example, RML+FnO uses CSV files and Relational databases, and RML-Star uses CSV files. Why is that not on the table?
Why do some mapping language specifications have an “X” for their data source specification? For example, RML In-memory uses in-memory data structures like dictionaries and hash tables. Why is that not on the table?
Why is the Declarative Mapping Rules column mostly empty when all RML extensions use RDF?

Section 4: The mapping-template tool
Positive: This section describes the mapping-template tool and describes all its functionalities. Additionally, it highlights how this newer version of the mapping-template tool can execute mappings with the new RML specification.
Negative: This section is mostly similar to the previous paper's description of the mapping-template tool, with the only exception of the description of how the proposed tool incorporated the execution mappings with the new RML specification. The overlapping portion of this section (the first five paragraphs) needs to be rewritten.

Section 5: Qualitative Evaluation
Positive: This section describes existing test cases and examples that allow potential users to learn how to execute the mapping template tool. It provides a figure that illustrates some of these examples. The introduction of a table allows for a better understanding of the test cases.
Negative: This is the same study as the one conducted in the authors’ previous work. A new qualitative study with another set of test cases is recommended so that this work can be differentiated from the previous work.

Section 6: Quantitative Evaluation
Positive: This section presents the experimental configuration and results achieved. The mapping-template tool outperforms Morph-KGC in terms of execution time and memory usage. Additionally, the authors added a new set of experiments based on the test cases given in the Knowledge Graph Construction Challenge 2024 performance track.
Negative:
The authors simplified the join operation for the SHAPES data source (which is the heaviest operation in the entire GTFS-Madrid-Bench dataset). If the join had not been simplified, it could have helped present different results from the ones shown in the authors’ previous work because the results presented for this set of experiments are currently the same as the ones in the last work.

Section 7: Adoption cases of the mapping-template tool
Positive: This section presents real-world scenarios where the proposed tool is used.
Negative: This section is very similar to what is said in the previous work. It needs to be revised.

Section 8: Conclusions and Future Work
Positive: This section works well as the paper's closing, illustrating the conclusions reached and the author's future direction.
Negative:
The conclusions are mostly the same as what was said in the authors’ last work. It would be beneficial to discuss in the conclusions how FlexRML outperformed the proposed solution in various test cases and, given the understanding of how FlexRML functions, to incorporate some of its approaches to improve the mapping-template tool’s performance as possible future work. The portion about future work is just a suggestion.

Review #2
By Davide Lanti submitted on 08/Dec/2024
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

*** Contribution ***

The authors present mapping-template, a tool based on the Apache Velocity template engine for specifying transformations between different data formats. To frame their tool at an abstract level, they introduce a workflow for the generalization of the mapping process and perform a qualitative analysis of existing mapping approaches against this workflow. A comparison, both qualitative and quantitative, of mapping-template against state-of-the-art R2RML/RML engines concludes the paper.

The proposed abstract workflow and the extensions to the tool are valuable contributions to the field, and the qualitative analysis is thorough. However, in my opinion certain claims and comparisons require further clarification and validation, as detailed below.

*** Novelty ***

This aspect of the work raises significant concerns. The submission substantially overlaps with [A], a 19-page paper from the Knowledge Graph Construction Workshop that shares the same title as the present work. Many sections are word-for-word identical, which raises questions about the level of new contribution.

The authors highlight some notable extensions in the abstract:

- Analysis of the state-of-the-art for Knowledge Graph Construction (KGC) within the context of the proposed workflow (Table 1 and Section 3.4).

- Direct support for the execution of RML mapping rules (end of Section 4) in a tool called "mapping-template-rml".

- Inclusions of results from the KGC Challenge 2024 and extensions of tests for mapping-template-rml.

While these extensions are useful, the overlap with [A] should be explicitly addressed, and [A] should be properly cited (I was surprised not to find any citation of this work).

Regarding the claim made in the abstract about addressing a gap in literature---support for conversion between different data formats---this requires deeper discussion. The statement, "Existing solutions for the declarative lifting of data to RDF are not able to effectively support knowledge conversion towards a generic output," is only partially justified. Specifically, in Section 2.3, the authors acknowledge solutions like [54] and [19], which already address lowering mappings to heterogeneous data sources. Clarifying the precise gap filled by this work is essential to strengthening the novelty claim.

Finally, the use of the term "knowledge conversion" is ambiguous. Clearer terminology or a concrete definition would improve comprehension.

*** Significance of Results ***

The qualitative analysis considering the requirements for declarative mapping languages is detailed and provides useful insights. However, I have a few concerns on the other half of the qualitative analysis, as well as the quantitative analysis:

1) Relevance of the quantitative results for "mapping-template":

- The analysis is undermined by the fact that mapping-template does not support RML, and that to fill this gap authors manually converted the RML mappings into mappings for their tool.

- The manual conversion raises questions about fairness. Were these mappings validated for equivalence? Could manual optimization during conversion have provided an unfair advantage? Explicit validation and examples of these mappings would clarify this issue.

- Furthermore, the comparison between mapping-template and mapping-template-rml (which natively supports RML) is not thoroughly discussed. For example, Figure 6 suggests differences in performance, but the reasons for these differences are not adequately analyzed.

2) Applicability to Big Data Scenarios:

- Results indicate that mapping-template-rml struggles with memory consumption and large datasets, raising concerns about its scalability. Addressing this limitation would enhance the tool's practical relevance.

3) Omitted Experiments:

- The omission of comparisons with morph-kgc for mapping-template-rml and the lack of tests at larger scales for Figure 6 reduce the comprehensiveness of the analysis. Including these would strengthen the quantitative evaluation.

*** Quality of Writing ***

The writing is generally clear but could be improved with better organization and formatting. I provide a few examples:

- Figures and Tables:

-- Figure 1 lacks a detailed description clarifying the meaning of the layers, rectangles, rounded rectangles, and so on.

- Section Structure:

-- I would add a sentence introducing Sections 3.1 to 3.3.
-- I would reformat Section 5 to separate the discussion of requirements and test cases. The structure of Section 6 could be improved in a similar way.

- Minor Issues:

-- Labels like C1, C4, etc., in Section 5 are used without explanation and should be introduced (they were in the workshop publication).

-- "The declarative specification of heterogeneous data sources (Data source specification) is supported to enable several use cases through different extensions" -> What is the subject of this sentence?

-- the template "http://example.com/{class}" on the right-hand side of Fig. 3 should be "http://example.com/{type}" instead.

-- Typo in Footnote 27.

Finally, I think that a couple of statements in the paper should be slightly adjusted. Specifically:

1) "For an average developer, without a deep understanding of RDF, our template-based approach appears to be less verbose and simpler than RML-based solutions": according to my understanding, RML is not designed with that goal in mind. The goal of RML is rather to provide an exchange format for declarative mappings towards heterogeneous sources.

2) "to generate RDF-star, a user knowing MTL should only be able to write RDF-star, while a user knowing RML should learn RML-star": to be fair, one should also acknowledge that RML-star is just a slight extension of RML. Anyone mastering RML should not encounter particular problems in "learning" RML-star.

*** Reproducibility ***

The reproducibility of the work seems appropriate. The authors provide all necessary materials to repeat the experiments in a public github repository, and the tools are distributed under an Apache 2.0 license.

*** Final Considerations ***

The abstract framework introduced by the authors and the tools presented are valuable contributions to the community. However, the following aspects require attention before the work can be considered for publication:

- Explicitly address the overlap with [A] and better articulate the novelty of the current submission.

- Revise the claims made in the abstract and conclusions regarding literature gaps to ensure consistency with the discussion in the main text.

- Provide a more detailed quantitative analysis, including validation of manually converted mappings, scalability experiments, and comparisons with additional tools.

- Improve the paper’s structure, formatting, and clarity to enhance readability.

With these revisions, the paper would make a stronger and more compelling contribution to the field.

[A] Mario Scrocca, Alessio Carenini, Marco Grassi, Marco Comerio, Irene Celino: Not Everybody Speaks RDF: Knowledge Conversion between Different Data Representations. KGCW@ESWC 2024

Review #3
By Ignacio Dominguez Martinez submitted on 28/Feb/2025
Suggestion:
Minor Revision
Review Comment:

The authors propose the mapping-template, a tool based on declarative mappings that enables semantic lifting of heterogenous data into RDF and semantic lowering of RDF into other heterogenous data formats.

The paper is clearly written and well presented. The authors provide an comprehensive state of the art on declarative mapping languages and related terminology as well as detailed workflow for for knowledge conversion.

The paper includes a comparative table with the coverage of relevant aspects by the mapping language specifications identified in the state of the art.

It provides thorough qualitative and quantitative evaluations that prove the quality and performance of the tool. Additionally, the authors highlight the interoperability with the standard RML mapping language.

Finally, the mapping-tool is made available on GitHub as an open-source implementation and it is well integrated with Chimera framework for knowledge graph construction and data exchange.

-------

Remarks and Questions:

** Section 1 (Introduction)

- Motivation could be improved by elaborating on the motivation for implementing and adopting the mapping template-tool before other tools (e.g., engines based on languages like RML). The use cases listed in Section 7 (Adoption cases of the mapping-template tool) could help the reader see the value of the mapping tool based on its application in real scenarios in industrial and research projects.

** Section 3 (A workflow for declarative knowledge conversion):

- The blocks within the Mapping Scenario layer (from Figure 1) could be described in the text, with examples and detailing their interactions with the blocks of the Mapping Language layer.

- Table 1 references “Source/Target RML” but this should point to RML-IO (https://kg-construct.github.io/rml-io/ontology/documentation/index-en.html). In this sense, RML-IO cover the Data Source Specification in general (i.e., not only Web APIs and Streams, for example, SPARQL endpoints or files based on DCAT)

- Section 3.3 (Load). Suggest rephrasing to “The Data Sink Specification defines how to connect to the data sink” (Data Sink Access)”

**Section 5(Qualitative Evaluation)

- The evaluation focuses on the semantic lifting, while semantic lowering is only briefly mentioned with reference to a tutorial of Chimera. The examples provided in the documentation tackle semantic lifting from multiple formats to RDF but there are not examples that show mappings from RDF to other data formats.

** Section 7 (Adoption cases of the mapping-template tool)

- A figure that shows the end-to-end data exchange would help the reader understand the general idea behind the mapping-template tool.

-------

Nits:

- p. 7: “As for the *Data Source Access*, different configurations may be specified.” should be “As for the *Data Sink Access*, different configurations may be specified.”

- p. 9: Broken link: https://github.com/cefriel/mapping- template/tree/feat- rml- compiler/rml

- p. 11: "The example rml-star considers a mapping scenario provided in the RML-star documentation with nested”, but the example in the repository is named “rdf-star”.

- p. 13: “The definition of iterators in case of scenarios associated with complex nested data (C2, cm-r14, cm-r27) is delegated”, should be “The definition of iterators in case of scenarios associated with complex nested data (Cm-r2, cm-r14, cm-r27) is delegated”