Incremental Knowledge Graph Construction from Heterogeneous Data Sources

Tracking #: 3790-5004

Authors: 
Dylan Van Assche
Julian Rojas
Ben De Meester
Pieter Colpaert

Responsible editor: 
Cogan Shimizu

Submission type: 
Full Paper
Abstract: 
Sharing real-world datasets that are subject to continuous change (creates, updates and deletes) poses challenges to data consumers, e.g., reconciling historical versioning, handling change frequency. This is evident for Knowledge Graphs (KG) that are materialized from such datasets, where keeping the graph synchronized with the original datasets is only achieved by frequently and fully regenerating the KG. However, this approach is time-consuming, loses history, and wastes computing resources due to reprocessing of unnecessary data. In this paper, we present a KG generation approach that is capable of efficiently handling evolving data sources with different data change signaling strategies. We investigate different change signaling strategies observed in real-world data sources, propose the corresponding algorithms to detect data changes, and introduce a declarative approach that relies on RML and FnO to materialize and characterize data changes for evolving KGs. Our approach allows to optionally and automatically publish detected data changes in the form of a Linked Data Event Stream (LDES), relying on the W3C Activity Streams 2.0 vocabulary to describe changes semantically. This way, changes can be communicated to consumers over the Web. We implement our approach in the RMLMapper as IncRML (Incremental RML). We functionally evaluate our approach on a set of test cases, and quantitatively using a modified version of the GTFS Madrid Benchmark (taking change into account), and various real-world data sources (bike-sharing, transport timetables, weather, and geographical). On average, our approach reduces the necessary storage and used computing resources for generating and storing multiple versions of a KG (up to 315.83x less storage, 4.59x less CPU time, and 1.51x less memory), reducing KG construction time up to 4.41x. The performance gains are more evident for larger datasets, while for smaller datasets, our approach’s overhead partially nullifies such performance gains. Our approach reduces the overall cost of publishing and maintaining KGs, which may contribute to the uptake of semantic technologies. Moreover, the use of LDES as a Web-native publishing protocol enables that not only the KG publisher benefits from the concise and timely communication of changes, but also third-parties if the publisher chooses to make it public. In the future, we plan to explore end-to-end performance, possible optimizations on our change detection algorithms and the use of windows to expand our approach to streaming data sources.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Umutcan Serles submitted on 03/Mar/2025
Suggestion:
Minor Revision
Review Comment:

The authors seems to have addressed my point about the discussion section adequately. However I still could not find an answer to my question regarding how to represent complex property paths for watched. Nevertheless, I believe this is a minor thing that can be handled before the final publication stage.

Review #2
Anonymous submitted on 06/Mar/2025
Suggestion:
Minor Revision
Review Comment:

General opinion: as "reviewer #2" of the first version of the paper, I consider that the proposed IncRML approach is promising and well-structured, and that in this new version of the paper, a large majority of the improvement points I mentioned have been positively addressed. I therefore recommend this paper for publication while suggesting a few more points for improvement to the authors (see below), which mainly consist of adding clarifications, details, and enhancing the writing flow.

Questions & remarks:

1. "2.2. Change Data Capture" : I suggest that you end the subsection emphasizing that in this paper you'll be focusing on the snapshot-based and timestampb-based approaches, namely in §4.3.
2. "2.5. Incremental mapping rules execution" and global : you often use the expression "support heterogeneous data," which is a good thing in itself, but - given that the IncRML proposal is very interesting for operational contexts where the knowledge graph would be built on complex datasets - it seems to me that you should clarify that the paper focuses experimentally on mono-source heterogeneous data and suggest (e.g., in the conclusion) ways to integrate IncRML in multi-source contexts. This kind of clarification would facilitate, as mentioned in my previous review, the adoption of the IncRML proposal by teams of system architects / system integrators by providing them with a clear vision of the proposal's scope. Paper [1] referred to such multi-source situations and proposed a posteriori reconciliation of data via an orchestration system; perhaps you could draw inspiration from such references to position IncRML.
3. "4.1. IncRML" and "Fig. 3" : I understand that you focus on CDC management and the production of a change log since it is the core of your research and proposal, but it seems to me that you should assist the reader just a little bit more by presenting the entire processing chain, i.e., by highlighting the LDES client and the triplestore. Since Fig. 3 is already quite dense, a written formulation such as "dynamic dataset --[FNO-based CDC]--> extracted changes --[RML-based change signaling]--> LDES artifacts --[LDES-based materialization]--> Triplestore" positioned right at the beginning of §4 would easily resolve the issue.
4. "5.2. KG generation with RML and CDC FnO functions": The phrase "A separate Triples Map is optionally used to [...]" is suggestive, while you make ample use of the proposed technique in the following sections; it seems useful to rephrase it with a wording such as "can be used, as exemplified later in Listing 5 for [...]"
5. "5.2. KG generation with RML and CDC FnO functions": The argument "but other ontologies can be used in the mappings as well" is open-ended; could you provide references for examples?
6. Listings 6 and 7: It could be useful to remind the reader that these artifacts are produced at each occurrence of an IncRML run.
7. "5.5. GTFS Madrid Benchmark extension": This development / section is a consequence of the evaluation step, hence it should'nt be part of §5 but of §6.
8. "6.1.4. Ingestion": Although I know what Virtuoso is, I suggest that you add a reference to the tool in the sentence "We perform this measurement over a Virtuoso instance" for readers who are not familiar with it. Additionally, please provide the Virtuoso release version for the sake of reproducibility of the experimental setup.
9. "6.1.4. Ingestion": I strongly regret that the tool https://github.com/julianrojas87/incrml2sparql, which is a key element of the proposal and experimentation, is not better explained or highlighted. A process diagram describing the principle of the implemented algorithm is at a minimum desirable, both for the clarity of the discussion (the algorithm and the tool should have been presented as early as section 5) and because there is no guarantee of the longevity of this code repository.
10. "7.4. Ingestion": It could be useful to add details on how you overcame the query length limit (i.e., the --limit parameter in incrml2sparql and what the optimal value you used was). Also, what about using a dataset chunking strategy?
11. "8. Conclusion": It could be useful to add references or a glimpse of how-to for some of your future work claims in order to provide a sustainable direction, particularly in the sentences "as incremental generation has only been tackled recently" and "handling schema-level (ontology and mappings) changes efficiently is an important aspect to be considered."

Minor remarks:
- Typo in "6.2. Evaluation setup": "to use theLDES Logical Target" => add a space before LDES.
- Missing caption in "Table 6" => add "Lower is better".
- Typo in "Table 12": "Total execution time (s" => add a closing parenthesis.

Refs:
- [1] "Designing NORIA: a Knowledge Graph-based Platform for Anomaly Detection and Incident Management in ICT Systems" (Tailhardat, et al. 2023) https://ceur-ws.org/Vol-3471/paper3.pdf

Review #3
Anonymous submitted on 27/Apr/2025
Suggestion:
Minor Revision
Review Comment:

The resource under review, Incremental Knowledge Graph Construction from Heterogeneous Data Sources, is the second iteration of a research work that describes an incremental approach towards the construction of KG resources that enables the detection, versioning and RDF materialisation of datasets that change over time.
Following on from my review of the previous version of this paper (review 4) I will proceed to make some comments related to how the points that were raised there have been addressed by the authors of the paper.

In the previous review a first issue was highlighted with the evaluation strategy, where the two strategies being compared produced different objects, one a materialised KG, the other the triples of the modified members, and as such could not be meaningfully compared. The authors have taken on board this point and have proceeded to expand their evaluation to include an interpretation step, using SPARQL UPDATE queries, at the end of their CHANGE strategy that now yields a materialised graph.

I am however not really convinced by this approach that they have taken and I think that there might have been some confusion which I’m going to try to clarify.
The point of highlighting that the pipelines produced different objects and as such there was not a ‘like for like’ comparison was that it meant that the results produced (especially the new ratios now introduced) could be called into question. But I’ve noticed that the same tables and figures (tables 6-11 and figures 4-6) from the first version were simply reproduced now under the header of a ‘functional evaluation’.
What I was expecting though, was to see exactly what the authors have done for the ‘functional evaluation’, with the same kind of tables and figures, but with the numbers measured using the CHANGE strategy that included the ‘interpretation step’ that they use in section 7.4.

In this last section (7.4), for example, the authors begin by saying that “we also put this into practice by ingesting the real-world datasets, via SPARQL UPDATE queries, into a Virtuoso triplestore” (p. 28), but this should have actually been the evaluation, not putting something into practice. Following from this I would also ask the authors whether doing the reconciliation step via the Virtuoso triplestore was a choice or a necessity, especially given the reported limits that this appears to introduce and how I think these stand in contradiction with the claims that the authors have made about “real-world applicability, usage and impact” (p.29) of their CHANGE strategy.
Can the two graphs (from the two strategies) not be materialized offline and the same metrics as the functional evaluation be used (i.e. execution time, peak memory, initial storage, etc.) to produce updated figures and tables?

The way that the authors have addressed the other points raised in the first review are satisfactory.