Incremental Schema Integration for Data Wrangling via Knowledge Graphs

Tracking #: 3286-4500

Javier Flores
Kashif Rabbani
Sergi Nadal
Cristina Gómez
Oscar Romero
Emmanuel Jamin
Stamatia Dasiopoulou

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
Virtual data integration is the current approach to go for data wrangling in data-driven decision-making. In this paper, we focus on automating schema integration, which extracts a homogenised representation of the data source schemata and integrates them into a global schema to enable virtual data integration. Schema integration requires a set of well-known constructs: the data source schemata and wrappers, a global integrated schema and the mappings between them. Based on them, virtual data integration systems enable fast and on-demand data exploration via query rewriting. Unfortunately, the generation of such constructs is currently performed in a largely manual manner, hindering its feasibility in real scenarios. This becomes aggravated when dealing with heterogeneous and evolving data sources. To overcome these issues, we propose a fully-fledged semi-automatic and incremental approach grounded on knowledge graphs to generate the required schema integration constructs in four main steps: bootstrapping, schema matching, schema integration, and generation of system-specific constructs. We also present NextiaDI, a tool implementing our approach. Finally, a comprehensive evaluation is presented to scrutinize our approach.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 30/Nov/2022
Review Comment:

I want to thank the authors for their endeavors and clear responses to my comments (R1). The revised version of the paper has been enhanced over the prior version, so I recommend the manuscript for publication.

Review #2
By Andriy Nikolov submitted on 04/Dec/2022
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

The paper has been revised substantially in several respects, most importantly:
- Now it clearly outlines that its scope is limited to the schema integration part of the process rather than the whole data integration problem, thus excluding the aspects like query processing which would have to be discussed otherwise.
- The user evaluation section was added to demonstrate the added value of the system from the perspective of practitioners.
In this way, I think, it covers most of my comments from the original review, either by resolving them or by delineating the intended scope. From my point of view, two remaining aspects are:
- Now, as the scope of the paper has been reduced, the question remains of whether the provided added value for the schema integration part only constitutes a sufficient contribution. In my view, the added user evaluation section supports the claim and provides sufficient evidence for this, but this is something which might be further considered.
- The paper could benefit from another proofreading to fix some writing style issues and typos.
Some typos I noticed:
p. 2, line 21: “Thus, allowing fast and on-demand data exploration.” -> incomplete sentence
p. 2, line 23: “As result” -> „As a result“
p.3, line 32: “There,” -> “There”
p. 9: lines 6-7: “Is candidate” -> “is a candidate”
p. 24, lines 28-29: “Thus, providing an intuitive user interface to use Nextia_DI functionalities.” -> incomplete sentence

Review #3
By David Chaves-Fraga submitted on 03/Jan/2023
Minor Revision
Review Comment:

First of all, I would like to thank the authors for the effort in providing a very detailed answer to all my comments and improving the paper considerably. It is now better motivated, provides a good overview of the state of the art and the evaluation has been extended with a user study that supports the proposed approach.

Two final comments that should be solved:
1) I still don’t understand why Squerall is included in the state of the art. Maybe, I’m missing something but from what I understood, it is an RML engine that does mostly the same tasks as Ontario, Ontop, or Morph-RDB. So if there is any other task that the engine is able to do (more related to what is proposed in the paper), please clarify it, if not I would recommend to remove it.

2) Ontop does not parse RML mappings. Declarative mapping rules such as R2RML (W3C recommendation), or RML (its main extension) are independent of the engine. If Algorithm 10 is able to generate RML (or R2RML) mappings, I would understand that the instance-level integration can be performed by any [R2]RML compliant engines (virtual or materialized). I would recommend to generalize Section 6.2 and present it in terms of the generation of declarative mapping rules. An up-to-date list of RML and R2RML engines can be found in the following links:,