Empowering the SDM-RDFizer Tool for Scaling Up to Complex Knowledge Graph Creation Pipelines

Tracking #: 3246-4460

Enrique Iglesias
Maria-Esther Vidal
Samaneh Jozashoori
Diego Collarana1
David Chaves-Fraga

Responsible editor: 
Guest Editors Tools Systems 2022

Submission type: 
Tool/System Report
Data has grown exponentially in the last years, and knowledge graphs have gained momentum as data structures to integrate heterogeneous data and metadata. This explosion of data has created many opportunities to develop innovative technologies. Still, it brings attention to the lack of standardization for making data available, raising questions about interoperability and data quality. Data complexities such as large volume, heterogeneity, and high duplicate rates affect knowledge graph creation. This work addresses these issues to scale up knowledge graph creation guided by the RDF Mapping Language (RML). For that purpose, we present the SDM-RDFizer, a two-fold solution to address these two sources of complexity. First, RML triples maps are reordered in a way that the most selective maps are evaluated first, while non-selective rules are considered at the end, reducing the number of triples that are kept in the main memory. In the second step, an RDF compression strategy and novel operators are implemented to avoid the generation of duplicated RDF triples and the reduction of the number of comparisons during the execution of RML operators between mapping rules. We test our tool on two well-known benchmarks, overcoming state-of-the-art RML engines, and hence, demonstrating the benefits of the proposed techniques.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 26/Sep/2022
Major Revision
Review Comment:

The paper presents an incremental improvement of SDM-RDFizer tool. The paper provides a good level of technical (implementation details), but should be improved in the presentation – now un-clarities are present, methodology description, relation to the other state of the art and the user communities’ relevance should be addressed, in particular.
An active Github repository with the tool is available.

Detailed comments are as follows:
- Abstract mentions many challenges, some facts and challenges are formulated un-specifically (e.g. “explosion of data” – obviously here “explosion” is not meant literally?!) and some un-clarities which challenges are relevant and why (is it integration or knowledge graph creation or something else) and who the expected audience of the paper is (now it is unclear from the abstract why and who should be interested to read it). In the abstract, it should be clearly communicated which contribution the work is delivering and which challenges are solved explicitly.
- Section 1: In this section it would be useful to include a typical scenario example, where the developed features are needed; “Iglesias et al.[12]” should be “Iglesias et al. [12]”.
- Section 2 should have an introduction to its subsections in the beginning (explaining what they are about and why these),
- The concept of “KG creation pipeline” should be explained explicitly before it starts to get used – also it would be of added value to explicitly indicate which challenges within it explicitly are addressed with this work,
- In motivation, it would be beneficial to mention for which specific use cases/ practical scenarios the work is useful,
- Section 2.3: it is unclear from where the requirements arise, and how they are connected with presented research, more context is needed in the text,
- Some terminology is used not conventionally at times. For example: “virtual knowledge graph creation process (formerly known as ontology-based data access)” – knowledge graph creation is certainly very different from ontology-based data access in general – the terminology should be used more accurately!
- In related work, one should not only describe any relevant things that exist, but also indicate where shortcomings are, or which gaps the solution presented here addresses,
- Methodology behind the work needs to be explained,
- The structure of sub-section 4.3 needs to be explained,
- At times informal expressions are used e.g. “thanks to the fact” – the language should be more formal,
- The last sentence of the 1st paragraph of section 5 is not complete (does not finish with a full stop) – it should continue with an explanation of how the following subsections are organized; Evaluation of the tool’s new features look to be conducted mainly against itself – it is unclear whether this is an appropriate way to evaluate the development or if something is missed out.
- Section 6: it would be useful to not only list in which projects the tool is used in general, but also in which of these use cases the developed new features are relevant, and which impact the new developments have on the use cases – otherwise the information is not very useful in the context of the paper (and is mostly name-dropping) – and also there is little value in terms of other similar use cases deciding on the suitability of the tool; “spanish national project” should be spelled as “Spanish national project” and “Spanish Cities” should be spelled as “Spanish cities”.
- Overview of literature is not covering how the addressed challenges are approached in practice by users/developers (in user communities e.g. life sciences, e-Commerce, other domains where such solutions can be needed); a large deal of the referred papers are self-citations from the author team; references nr. 11 and 30 are identical.

Review #2
By Dominik Tomaszuk submitted on 29/Dec/2022
Minor Revision
Review Comment:

The paper introduces data management techniques that use novel data structures and physical operators to execute RML triples maps efficiently. These techniques have been implemented in SDM-RDFizer v4.5.6. The paper is an extension of a work published at the CIKM 2020 conference. The authors propose further features: data structures for RDF data compression and physical operators. The paper is written quite well. However, I encourage you to improve the readability and work on the layout of the paper. The documentation for the tool is excellent. I am impressed with the video tutorial. Docker has been prepared, which makes it much easier to use the tool.

# Notes, comments, and questions

1. x:Gene, ex:Sample, and ex:Tumor should be in monospace
2. "ID_sample" in "dataSource1" and "dataSource2": do we need quotation marks there? Is monospace font not enough?
3. Despite a relatively long section on motivations, it is still not entirely clear what motivated the authors of the paper
4. "Based on the parameters defined by Chaves-Fraga et al., we elucidate the requirement": It would be easier for readers if a reference number appeared in these types of sentences.
5. It isn't easy to understand what contributions the new release brings. This information should be at the beginning of the paper.
6. in the last years -> in the last few years
7. such as large volume -> such as large volumes
8. work reported in Iglesias et al. -> work reported by Iglesias et al.
9. The title of the Motivation and Requirements section is misleading
10. Related Work is inferior. Many formats, tools, and languages are missing. For example
10.1 XML:
10.1.1 XSPARQL (Bischof, Stefan, Nuno Lopes, and Axel Polleres. "Improve Efficiency of Mapping Data between XML and RDF with XSPARQL." International Conference on Web Reasoning and Rule Systems. Springer, Berlin, Heidelberg, 2011.)
10.2.1 Gloze (Battle, Steve. "Gloze: XML to RDF and back again." Jena user conference. Vol. 6. 2006.),
10.2 CSV:
10.2.1. Tarql (Cyganiak, Richard. Tarql (sparql for tables): Turn csv into rdf using sparql syntax. Technical report, 2015.),
10.2.2.CSVW/csv2rdf (Herman, Ivan, Gregg Kellogg, Jeremy Tandy. "Generating RDF from Tabular Data on the Web". W3C Recommendation, 2015),
10.3 RDB:
10.3.1. AuReLi (Polfliet, Simeon, and Ryutaro Ichise. "Automated mapping generation for converting databases into linked data." Proc. 9th International Semantic Web Conference (ISWC2010). 2010.),
10.3.2. Triplify (Auer, Sören, et al. "Triplify: light-weight linked data publication from relational databases." Proceedings of the 18th international conference on World wide web. 2009.)
11. What was the disk speed of the experiment computer? I recommend adding speed. You can use sudo hdparm -tT /dev/sdX for this.
12. The link https://cancer.sanger.ac.uk/cosmicGRCh37,version90,releasedAugust2019 is broken. I guess it should be https://cancer.sanger.ac.uk/cosmic
13. What about RDF-star support?
14. SDM-Genomics-Dataset is not publicly available. Please change private mode to public mode on Figstare and use DOI in the paper (now authors use Github URL).

Review #3
By Antoine Zimmermann submitted on 13/Mar/2023
Minor Revision
Review Comment:

The paper describes a system that implements the RDF Mapping Language specification in a way that optimises its scalability, better handles duplicates, lowers memory usage, with special attention to processing complex mappings.
The motivation in Sec.2 explains key issues with performant mapping execution and what's required for a KG creation pipeline. The state of the art of the field is covered in Sec.3 with highlight of some of their limitations. Sec.4 describes the proper contribution, namely SDM-RDFizer in its latest version, showing its architectural components and some of the key algorithms used for optimisation. Sec.5 provides experimental details and results. Sec.6 finally analyses the characteristics of the tool that makes it worthwhile.

Main remarks:
As a scientific paper, the work is well described, the structure is clear, the state of the art is well covered, experiments are based on a well known benchmark and the conclusions are consistent with the results shown. The technical description of the tool components and algorithms is a little hard to follow, but with careful attention it is clear enough. The final analysis provides evidence that the work is worth publishing as a "system paper" at SWJ.

As a tool, the files provided are available at Zenodo, which should make them stable. They are well organised, with clear documentation. Installing the tool is straightforward. However, when I executed it with the example files, I got an error saying "FileNotFoundError: [Errno 2] No such file or directory: './files/sampleSource1.csv'" because the "files" directory was not in the current directory but inside the "main_directory" provided in the config.ini file. This should be easy to fix and does not prevented me from being able to test the tool.

However, I only tested the tool on the sample mappings in the data files, which are trivial. The resource files themselves do not describe how to reproduce the experiments of the paper, but can be looked up from a separate Github repository. Reproducing the experiments completely require a number of operations and manipulations that are not straightforward. Reproducibility relies on the existence and maintainance of the two benchmarks that are independent from this paper. This does not provide a perfect guarantee of the reproducibility in the long term but is ok in medium term, and maybe exact reproduction of these experiments in the distant future is irrelevant anyway.

Minor comments:
- Intro:
* SPARQL Generate: the "official" spelling is "SPARQL-Generate" with a dash
* last paragraph: "Section[new line]3" -> make space non-breaking "Section~3"

- Sec.2:
* 2.1: "into an RDF knowledge graph data" -> "into an RDF graph"
* 2.2: "referencing to" -> "referring to"
* Fig.2: instead of putting "Time (secs)" at the top of the table, there could be explicit durations in the cells with, e.g. "11,961.81 s". "Timeout" is not a duration in seconds, so this would avoid a semantic mismatch (also, at first, I did not notice the numbers were in seconds)

- Sec.3:
* 3.1: "virtual knowledge graph creation process (formerly known as ontology-based data access)" -> I disagree with this phrasing: OBDA is a different concept than virtual KG creation process. OBDA involves different things and techniques that may or may not include virtual KG creation (although it is typically part of it)
* 3.1 "MorphCSV [8] propose" -> proposes

- Sec.4:
* 4.3.1: "a hash table where the key is the RDF resource" -> the key is the IRI or literal, not the resource itself (which could be a physical object or an abstract thing)
* 4.3.1: "A PPT" -> "A PTT"
* Algorithm 1, line 3 and line 4; Algo 2, line 3 and 4; Algo 3, line 6 and 7: "PPT" -> "PTT"

- Sec.5:
* 5.2: "The Figure 7a" -> "Figure 7a"

- References:
* Ref 5 and 20 are for the same paper.
* The level of details of the references vary much from one ref to another. This should be homogenised and some references should be completed