Empowering the SDM-RDFizer Tool for Scaling Up to Complex Knowledge Graph Creation Pipelines

Tracking #: 3489-4703

Authors: 
Enrique Iglesias
Maria-Esther Vidal
Samaneh Jozashoori
Diego Collarana
David Chaves-Fraga

Responsible editor: 
Guest Editors Tools Systems 2022

Submission type: 
Tool/System Report
Abstract: 
The significant increase in data volume in recent years has prompted the adoption of knowledge graphs as valuable data structures for integrating diverse data and metadata. However, this surge in data availability has brought to light challenges related to standardization, interoperability, and data quality. Knowledge graph creation faces complexities arising from factors such as large data volumes, data heterogeneity, and high duplicate rates. This work addresses these challenges by focusing on scaling up declarative knowledge graph creation specified using the RDF Mapping Language (RML). We propose SDM-RDFizer, a two-fold solution designed to address these complexities. Firstly, we introduce a reordering approach for RML triples maps, prioritizing the evaluation of the most selective maps first to reduce the memory consumption. Secondly, we employ an RDF compression strategy, along with optimized data structures and novel operators, to prevent the generation of duplicate RDF triples and optimize the execution of RML operators. We evaluate the effectiveness of SDM-RDFizer using established benchmarks, which demonstrate its superiority over existing state-of-the-art RML engines, highlighting the tangible benefits of our proposed techniques. Furthermore, the paper presents real-world projects where SDM-RDFizer has been utilized, providing insights into the advantages of declaratively defining knowledge graphs and efficiently executing these specifications using SDM-RDFizer.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Dominik Tomaszuk submitted on 08/Jul/2023
Suggestion:
Minor Revision
Review Comment:

I would like to express my gratitude to the authors for addressing my comments and suggestions. I am pleased with the revisions made in the new version. As the author emphasized, this is primarily a tool report paper, and taking that into consideration, I am now inclined to accept it. However, I have noticed that some of my comments regarding the Related Work section have been overlooked or dismissed by the author without any explanation. I still believe that the Related Work could be improved, and it strikes me as odd that the authors did not acknowledge or incorporate several suggestions that were provided.

Review #2
By Antoine Zimmermann submitted on 13/Jul/2023
Suggestion:
Accept
Review Comment:

The new version of the paper addresses the concerns I had before. This is an improved version, though I still find that the technical development is a bit hard to follow but I don't see anything that is truly objectionable.

Review #3
Anonymous submitted on 30/Jul/2023
Suggestion:
Major Revision
Review Comment:

Unfortunately, the paper still reads like a report for an incremental tool development, but not as a self-sustained, reflexive and complete research paper.

The quality of the paper has not really improved since the last submission.

There is no consistency in description of the challenges and the methodology is not clear. Requirements are generic, and it is not clear wherefrom they were derived. It is also not clear what are the use cases and scenarios that need this work. In the introduction, an ESWC 2023 event benchmark is mentioned, and in the next section already "EU H2020 funded iASiS project" as a base.
Who really needs this tool and what for?
The research questions themselves are mentioned for the first time in Section 5 (!). What were the authors doing till then - hacking?

There are lacks in essential details when describing the approach. Why, for example, RML is chosen to apply in the first place?
Or: "Three state-of-the-art RML engines, i.e., RMLMapper v6.08, Morph-KGC v2.1.19, and SDM-RDFizer v3.2, are utilized to create this portion of the KG; the engines timed out in five hours." -> in which environments, and on which machines and technical settings the experiments were run?

There is no clear comparison to the state of the art in the light of research questions. But there are unsubstantiated (sometimes self-praising) statements f.e. "Since its first release, the SDM-RDFizer has caught the attention of practitioners and knowledge engineers
due to its good results w.r.t. other RML engines." -> what are "good results" - where and on what? reference? "has caught" is also also another example of an informal language use, see also below.

The fact that the tool was "sold" in several EU project does not mean that much. One cannot just name-drop project names without any detailed explanations of what exactly the tool has improved there and how. And here are no such details provided (for the evaluation of the problems that are also not clearly described).

The text should be improved.
For example, KGs, DISs, RLM are introduced in the paper multiple times. One should introduce acronyms once, and then use them through the remaining paper and not introduce them over and over again.
There are sentences with unclear grammar, e.g. see this one right in the abstract: "We evaluate the effectiveness of SDM-RDFizer using established benchmarks, which demonstrate its superiority over existing state-of-the-art RML engines, highlighting the tangible benefits of our proposed techniques." -> what does "demonstrate" here - benchmarks?
Title of section 4: why the word "tool" is not starting with a capital letter?