Review Comment:
This paper presents a framework for semi-automatic and incremental data integration using KGs. The authors present a new approach composed of several tasks (bootstrapping, schema matching, schema integration, etc). The main contribution of the paper is schema creation - matching from heterogeneous data sources: tabular without schema (i.e. CSV) and tree-based formats (i.e. JSON and XML), to merge them into an integrated vocabulary that may evolve depending on the changes in the source schemes, but without tackling any ontology. Hence, the work is more aligned with schema generation/integration which is only one of the parts of a data integration process (i.e. the proposal actually does not transform or integrate data as far as I understood).
The contributions of the paper are very interesting, and in general, they are well presented. I really liked the figures with the colors as they help the reader to understand the steps performed by the proposal easily. As it is submitted as a full paper I will evaluate it over the requirements (originality, significance of the results, and quality of writing) recommended by the journal together with their strong and weak points.
(Originality)
The proposal relies on a set of rules that are applied during the construction of the integrated schema. To the best of my knowledge, it is one of the first approaches that go beyond relational databases (where approaches take the advantage of schema and constraints to integrate the schemes) and provides a feasible approach to schema integration of heterogeneous sources. However, I was expecting a deeper analysis and demonstration of why having an RDB is different to have schemaless input sources but from a theoretical perspective, so I would like to see a formalization of what is exactly the problem is addressed, and what is the impact on a data integration system.
As the contributions are more aligned with schema integration, I didn’t understand very well why mapping rules are included in the related work and mentioned several times. Mappings are focusing on providing a set of rules for transforming data instances into a graph that must follow a target vocabulary (e.g., an ontology) (see formal framework defined in [1]), and the integration of schemes is made manually. Hence, I don’t see the relationship between these approaches and solutions on creating mapping rules (“Supported data sources” section). Instead, I would put more focus on analyzing (semi)automatic generation of mapping rules from the schema matching perspective (MIRROR, D2R, etc.) but I’m missing comparison against many systems and solutions such as AutoMap4OBDA[2] (and in general, the Sicilia’s thesis dissertation[3]), which is closed to ontology learning methods, or the recent Facade-x [4,5] proposal which creates an intermediate representation in RDF and where the authors support several data formats (beyond CSV and JSON), see: https://github.com/SPARQL-Anything/sparql.anything. SemTab approaches (https://www.cs.ox.ac.uk/isg/challenges/sem-tab/) may also be interesting to analyze as they automatically made schema matching between KG and tabular sources.
What are exactly the requirements for including the end-to-end DI systems? Is Squerall included but not Ontario [6] because the engine provides an external UI to create mapping rules? In the end, the authors do not know if Squerall integrates or not schemes. Why other engines such as Morph-RDB are not included?
I would like to see more details about the running example or use-cases where this approach fits, to motivate better the necessity of these approaches. My concern here is that from my personal experience, most of the projects and KG construction process require an ontology to be compliant with, so I don’t know how this solution can help a general OBDA integration pipeline. In the end, the proposed approach needs user-provided labels, what is the impact of this task in comparison to creating the mapping rules?
(Significance of the results)
The evaluation presents three different experiments: bootstrapping, schema integration, and real data, and IMO this is the weakest part of the paper. The creation of datasets for validating the approach is a bit naïve and the experiments are very simple. I would encourage the authors to define a clear set of research questions that they want to answer, and prepare the experiments according to them. The approach is not evaluated against any other solution from the state of the art (I understand that this might be difficult, but at least provide a justification). It is not possible somehow to evaluate the OAEI approach against other solutions? In addition, I would also expect a more detailed discussion of the results obtained, more than only mention which are the actual results. What is the overall final impact of the different steps? Which are the steps that demand more time and why? etc. I do not think that justifying time impacts bc Jena is used is enough for a research paper. Finally, and more aligned with the usability, if the output of the process is an integrated vocabulary, how this is presented to the final user to be used? How the mappings are constructed to generate the RDF graph compliant with the integrated vocabulary? These are questions that came up while I was reading the evaluation section.
It is important to mention that in terms of reproducibility, the code and data used are available on Github, and the authors also provide guidelines and information on how to run the solution over the different experiments. However, no license is provided so I don’t know if the code can be reused or extended in the future.
(Quality of writing)
I don’t have specific comments on this topic. The paper is well written and, at least for me, was easy to follow as the authors use the same example for all the steps, which facilitates comprehension. As I already mentioned, the way how the examples are shown (with colors and references) really liked me and demonstrated the effort that the authors make to create an accessible resource. There are some typos that they should review, and also rules that maybe are not needed to define (e.g., rule I2 and rule I3 are exactly the same). The structure is correct and all figures enhance the readability of the paper.
My recommendation to this paper is a major revision considering the following relevant changes: i) enhance the motivation for the necessity of an approach like this, ii) extend and analyze in deep the state of the art, iii) change the evaluation with research questions, discussion of the results beyond numbers, and comparison against other solutions.
[1] Xiao, G., Calvanese, D., Kontchakov, R., Lembo, D., Poggi, A., Rosati, R., & Zakharyaschev, M. (2018). Ontology-based data access: A survey. International Joint Conferences on Artificial Intelligence.
[2] Sicilia, Á., & Nemirovski, G. (2016, November). AutoMap4OBDA: Automated generation of R2RML mappings for OBDA. In European Knowledge Acquisition Workshop (pp. 577-592). Springer, Cham.
[3] Sicília, A. (2016). Supporting Tools for Automated Generation and Visual Editing of Relational-to-Ontology Mappings (Doctoral dissertation, Universitat Ramon Llull).
[4] Daga, E., Asprino, L., Mulholland, P., & Gangemi, A. (2021). Facade-X: an opinionated approach to SPARQL anything. Studies on the Semantic Web, 53, 58-73.
[5] Preprint, but already accepted: https://sparql.xyz/FacadeX_TOIT.pdf
[6] Endris, K. M., Rohde, P. D., Vidal, M. E., & Auer, S. (2019, August). Ontario: Federated query processing against a semantic data lake. In International Conference on Database and Expert Systems Applications (pp. 379-395). Springer, Cham.
|