End-to-End Incremental Data Integration via Knowledge Graphs

Tracking #: 3138-4352

Javier Flores
Kashif Rabbani
Sergi Nadal
Cristina Gómez
Oscar Romero
Emmanuel Jamin
Stamatia Dasiopoulou

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
Data integration, the task of providing a unified view over a set of data sources, is undoubtedly a major challenge for the knowledge graph community. Indeed, such flexible data structure allows to model the characteristics of source schemata, rich semantics for the global schema and the mappings between them. Yet, the design of such data integration systems still entails a manually arduous task. This becomes aggravated when dealing with heterogeneous and evolving data sources. To overcome these issues, we propose a fully-fledged semi-automatic and incremental data integration approach. By considering all tasks that compose the end-to-end data integration workflow (i.e., bootstrapping, schema matching, schema integration and generation of querying constructs, we are able to address them in a unified manner.We provide algorithms for each task, as well as theoretically prove the correctness of our approach and experimentally show its practical applicability.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 21/Jun/2022
Major Revision
Review Comment:

The paper presents a system to address the data integration task incrementally and semi-automatically. The system consists of the following components, bootstrapping, schema matching, and schema integration to deliver a schema mapping and global schema as output.

It is interesting and practical to construct a system that can digest heterogeneous data sources incrementally and provide unified inclusive access to all of them. The paper has a good presentation, and the author explained the targeted data integration problem well. The logical flow of tasks to carry out the DI is well presented, and the work is motivated appropriately.

The proposed system seems a bit ad hoc and a soup of codes; there are no adequate justifications for the deployed or proposed methods in each component. There is not enough argument for how the proposed parts would go together compared to other alternatives.

The contribution of the paper is not clear. Although the authors did a pretty good job in the literature review, a more comparative study that distinguishes the proposed works from the existing works and highlights the contributions and novelty of this work would be insightful.
While the authors provided a big picture of the data flow of their system in figure 1, providing a clear overview of different algorithms and intermediate data that carry out the figure 1 depicted tasks would be helpful. There is a body of works on semantic data integration based on knowledge graphs [1] that could be discussed as related works.

In section 4, the proposed bootstrapping algorithm is presented, and the following statement is given regarding its input, 'The algorithm takes as input a structured or semi-structured data source and produces a graph-based representation of its schema.' However, the authors claim the proposed bootstrap algorithm can handle schemaless data sources in the contribution points. The assumption regarding the structuredness of the input data sources is not explicit.

In section 5, the incremental schema integration has been proposed; however, the justification of the proposed rules is not convincing. It is unclear how the proposed method can handle schema matching and integration challenges, for example, handling cases in which different elements in the schemas represent the same concept. The problem the authors try to address in this section should be defined more explicitly and formally.

While the paper discussed schema-level integration via mapping and rule-based data transformation, there is no discussion or attempt toward data-level integration, including redundancy elimination and cleaning.

In the experiments, the authors state their objectives as assessing bootstrapping and schema integration running time. However, a quantitative and qualitative evaluation of the integration process output and using some alternatives to prove the rationality of the proposed method is needed.

1. Samaneh Jozashoori, Maria-Esther Vidal: MapSDI: A Scaled-Up Semantic Data Integration Framework for Knowledge Graph Creation. OTM Conferences 2019: 58-75

Review #2
By David Chaves-Fraga submitted on 22/Jul/2022
Major Revision
Review Comment:

This paper presents a framework for semi-automatic and incremental data integration using KGs. The authors present a new approach composed of several tasks (bootstrapping, schema matching, schema integration, etc). The main contribution of the paper is schema creation - matching from heterogeneous data sources: tabular without schema (i.e. CSV) and tree-based formats (i.e. JSON and XML), to merge them into an integrated vocabulary that may evolve depending on the changes in the source schemes, but without tackling any ontology. Hence, the work is more aligned with schema generation/integration which is only one of the parts of a data integration process (i.e. the proposal actually does not transform or integrate data as far as I understood).

The contributions of the paper are very interesting, and in general, they are well presented. I really liked the figures with the colors as they help the reader to understand the steps performed by the proposal easily. As it is submitted as a full paper I will evaluate it over the requirements (originality, significance of the results, and quality of writing) recommended by the journal together with their strong and weak points.

The proposal relies on a set of rules that are applied during the construction of the integrated schema. To the best of my knowledge, it is one of the first approaches that go beyond relational databases (where approaches take the advantage of schema and constraints to integrate the schemes) and provides a feasible approach to schema integration of heterogeneous sources. However, I was expecting a deeper analysis and demonstration of why having an RDB is different to have schemaless input sources but from a theoretical perspective, so I would like to see a formalization of what is exactly the problem is addressed, and what is the impact on a data integration system.

As the contributions are more aligned with schema integration, I didn’t understand very well why mapping rules are included in the related work and mentioned several times. Mappings are focusing on providing a set of rules for transforming data instances into a graph that must follow a target vocabulary (e.g., an ontology) (see formal framework defined in [1]), and the integration of schemes is made manually. Hence, I don’t see the relationship between these approaches and solutions on creating mapping rules (“Supported data sources” section). Instead, I would put more focus on analyzing (semi)automatic generation of mapping rules from the schema matching perspective (MIRROR, D2R, etc.) but I’m missing comparison against many systems and solutions such as AutoMap4OBDA[2] (and in general, the Sicilia’s thesis dissertation[3]), which is closed to ontology learning methods, or the recent Facade-x [4,5] proposal which creates an intermediate representation in RDF and where the authors support several data formats (beyond CSV and JSON), see: https://github.com/SPARQL-Anything/sparql.anything. SemTab approaches (https://www.cs.ox.ac.uk/isg/challenges/sem-tab/) may also be interesting to analyze as they automatically made schema matching between KG and tabular sources.

What are exactly the requirements for including the end-to-end DI systems? Is Squerall included but not Ontario [6] because the engine provides an external UI to create mapping rules? In the end, the authors do not know if Squerall integrates or not schemes. Why other engines such as Morph-RDB are not included?

I would like to see more details about the running example or use-cases where this approach fits, to motivate better the necessity of these approaches. My concern here is that from my personal experience, most of the projects and KG construction process require an ontology to be compliant with, so I don’t know how this solution can help a general OBDA integration pipeline. In the end, the proposed approach needs user-provided labels, what is the impact of this task in comparison to creating the mapping rules?

(Significance of the results)
The evaluation presents three different experiments: bootstrapping, schema integration, and real data, and IMO this is the weakest part of the paper. The creation of datasets for validating the approach is a bit naïve and the experiments are very simple. I would encourage the authors to define a clear set of research questions that they want to answer, and prepare the experiments according to them. The approach is not evaluated against any other solution from the state of the art (I understand that this might be difficult, but at least provide a justification). It is not possible somehow to evaluate the OAEI approach against other solutions? In addition, I would also expect a more detailed discussion of the results obtained, more than only mention which are the actual results. What is the overall final impact of the different steps? Which are the steps that demand more time and why? etc. I do not think that justifying time impacts bc Jena is used is enough for a research paper. Finally, and more aligned with the usability, if the output of the process is an integrated vocabulary, how this is presented to the final user to be used? How the mappings are constructed to generate the RDF graph compliant with the integrated vocabulary? These are questions that came up while I was reading the evaluation section.

It is important to mention that in terms of reproducibility, the code and data used are available on Github, and the authors also provide guidelines and information on how to run the solution over the different experiments. However, no license is provided so I don’t know if the code can be reused or extended in the future.

(Quality of writing)
I don’t have specific comments on this topic. The paper is well written and, at least for me, was easy to follow as the authors use the same example for all the steps, which facilitates comprehension. As I already mentioned, the way how the examples are shown (with colors and references) really liked me and demonstrated the effort that the authors make to create an accessible resource. There are some typos that they should review, and also rules that maybe are not needed to define (e.g., rule I2 and rule I3 are exactly the same). The structure is correct and all figures enhance the readability of the paper.

My recommendation to this paper is a major revision considering the following relevant changes: i) enhance the motivation for the necessity of an approach like this, ii) extend and analyze in deep the state of the art, iii) change the evaluation with research questions, discussion of the results beyond numbers, and comparison against other solutions.

[1] Xiao, G., Calvanese, D., Kontchakov, R., Lembo, D., Poggi, A., Rosati, R., & Zakharyaschev, M. (2018). Ontology-based data access: A survey. International Joint Conferences on Artificial Intelligence.
[2] Sicilia, Á., & Nemirovski, G. (2016, November). AutoMap4OBDA: Automated generation of R2RML mappings for OBDA. In European Knowledge Acquisition Workshop (pp. 577-592). Springer, Cham.
[3] Sicília, A. (2016). Supporting Tools for Automated Generation and Visual Editing of Relational-to-Ontology Mappings (Doctoral dissertation, Universitat Ramon Llull).
[4] Daga, E., Asprino, L., Mulholland, P., & Gangemi, A. (2021). Facade-X: an opinionated approach to SPARQL anything. Studies on the Semantic Web, 53, 58-73.
[5] Preprint, but already accepted: https://sparql.xyz/FacadeX_TOIT.pdf
[6] ​​Endris, K. M., Rohde, P. D., Vidal, M. E., & Auer, S. (2019, August). Ontario: Federated query processing against a semantic data lake. In International Conference on Database and Expert Systems Applications (pp. 379-395). Springer, Cham.

Review #3
By Andriy Nikolov submitted on 04/Aug/2022
Major Revision
Review Comment:

The paper deals with the problem of semantic data integration, particularly focusing on integrating naturally semi-structured data sources like JSON, XML, or CSV documents. The authors consider a generic data integration framework supporting incremental data integration and present a method implementing the first stages of the process: schema bootstrapping from source documents and resolving schema-level representation differences relying on existing schema alignments. The bootstrapping algorithm first extracts the source documents’ models using the corresponding document-specific meta-model representation in RDFS then transforming them into RDFS domain models. These RDFS schemata then get integrated into a common schema by exploiting the schema alignments (the paper assumes that the alignments are already provided by some input process).

The topic of data integration of multiple sources is a relevant and important one. The use of RDF(S) graph structure naturally makes incremental integration easier as it involves less revision of hard constraints.
The focus on automated integration the semi-structured documents, particularly, defining an umbrella schema for tree-structured documents is interesting as data integration techniques mainly focused on more structured representation formats, like relational databases. The procedure described in the paper involves a set of deterministic transformation rules to go from the low-level data representation to an integrated schema that would enable transparent querying over the range of data sources. The automated integration procedure can in principle reduce the effort required on the part of the domain expert users.

However, the added research value of the whole approach could be motivated more. Particularly, it does not appear fully clear whether the described processes constitute a research problem as opposed to a purely engineering one. The proposed procedure seems to mainly involve applying a sequence of pre-defined rules to process the “mainstream” scenario, while leaving the most challenging problems outside of its scope and assuming that all important information will be received as input: e.g., how to obtain the alignments; how to handle overlapping/conflicting alignments or alignments with different granularities; how to repair inconsistencies; how to generate, model, and process more complex correspondences than 1-to-1 mappings (e.g., like the cases mentioned in Section 5.3), etc.

The evaluation experiments described in Section 6 also do not seem to showcase the added value of the method. The tests measure the runtime performance of the algorithm. However, it is not the runtime of applying schema alignments that presents problems in data integration. One can argue that this runtime performance is not really important at all, since it is a one-time offline procedure, the described algorithm is fully deterministic, reported runtimes are very low and, even if not, the application of rules can be optimized a lot via parallelization anyway. Thus, it should be motivated why this way of evaluation was chosen in the first place.

Finally, the system design description does not look complete without describing the query processing part. How would the chosen integrated schema representation be used to facilitate query execution? What kind of query translation rules will be required? How will joins be executed to optimize the performance over a set of semi-structured data sources? What kinds of indexes/caches would need to be maintained? Without considering these, the system design appears incomplete and it is not clear that the schema integration choices can be always made completely independently from the subsequent query execution requirements.

So, to summarize, in my view, the paper could be revised in order to be more convincing, in particular:
1. It should be clearly motivated what aspects of the workflow present research problems rather than purely engineering ones and the research contribution should be emphasized.
2. The evaluation should show how the proposed approach helps to solve these research problems more efficiently.
3. Separation of schema integration and query execution problems does not appear fully convincing. Perhaps, it would make more sense to describe the whole system design in one paper, as now it appears incomplete. Otherwise, there should be a discussion on how the proposed schema integration can be considered independently and does not impact query processing.
4. Alternatively, if the focus of work is on the system design, rather than on research problems, perhaps, the paper should be restructured as a system paper?

A minor comment: shouldn’t the rdfs:member property be reversed in Fig. 6 and Fig. 8 and have the container as a subject and the element as an object? (I think this is how it is defined in the RDFS specification (https://www.w3.org/TR/rdf-schema/#ch_member), if I am not mistaken)