IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources

Tracking #: 3674-4888

Authors: 
Dylan Van Assche
Julian Rojas
Ben De Meester
Pieter Colpaert

Responsible editor: 
Cogan Shimizu

Submission type: 
Full Paper
Abstract: 
Sharing real-world datasets that are subject to continuous change (e.g, additions, modifications and deletions) poses challenges to data consumers e.g., historical versioning, change frequency. This is evident for Knowledge Graphs (KG) that are materialized from such datasets, where keeping the graph synchronized with the original datasets is only achieved by frequently and fully regenerating the KG. However, this approach is time-consuming, loses history, and wastes computing resources due to reprocessing of unnecessary data. In this paper, we present a KG generation approach that is capable of efficiently handling changing data sources and their different data change advertisement strategies. We investigate different change advertisement strategies of data sources, propose algorithms to detect implicitly advertised changes, and introduce our approach IncRML (Incremental RDF Mapping Language). IncRML combines RML and FnO functions to detect and classify changes in data sources across an RML mapping process. We also allow to optionally and automatically publish these changes in the form of a Linked Data Event Stream (LDES), relying on the W3C Activity Streams 2.0 vocabulary to describe changes semantically. This way, the changes are also communicated to consumers. We present our approach and evaluate it qualitatively on a set of test cases, and quantitatively using a modified version of the GTFS Madrid Benchmark (taking change into account), and various real-world data sources (bike-sharing, transport timetables, weather, and geographical). On average, our approach reduces the necessary storage and used computing resources for generating and storing multiple versions of a KG (up to 315.83x less storage, 4.59x less CPU time, and 1.51x less memory), reducing KG construction time up to 4.41x. The performance gains are more explicit for larger datasets, while for smaller datasets, our approach’s overhead partially nullifies such performance gains. Our approach reduces the overall cost of publishing and maintaining KGs, which may contribute to the uptake of semantic technologies. Moreover, the use of LDES as a Web-native publishing protocol enables that not only the KG publisher benefits from the concise and timely communication of changes, but also third-parties if the publisher chooses to make it public. In the future, we plan to explore end-to-end performance, possible optimizations on our change detection algorithms and the use of windows to expand our approach to streaming data sources.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Samaneh Jozashoori submitted on 16/May/2024
Suggestion:
Major Revision
Review Comment:

This paper presents an approach to materialize knowledge graphs incrementally by capturing the data changes and providing them in the form of Linked Data Event Stream (LDES).

General comments:
The paper targets a very practical topic and provides a valuable solution. Overall, the paper is well written with correct extension of descriptions and explanation for most concepts. Despite the obvious attempt of the authors toward self-representativeness of the manuscript, there is still room for improvement, e.g., the definitions for the following basic or frequently used concepts can further help the readers to get onboarded: 1. Incremental Knowledge Graph Construction/Creation 2. Change Advertisement Mechanism and different strategies.

Summary of the evaluation:
- originality: while I can certainly see it as a novel system paper, the novelty of the approach as a full research is not well established in the paper (refer to my detailed comment below regarding the related work)
- significance of the results: the approach provides high value under the certain conditions which needs to be more transparently communicated in the paper (refer to my detailed comment below regarding section 7)
- quality of writing: overall very good. The cases where the paragraph or the figures can use improvements and ambiguity needs to be removed are mentioned in the detailed comments below.
- Accessibility: The link to the approach/tool is supprisingly missing! Although the links to the experiments are provided, the approach/tool itself is not accessible.

Detailed comments:
- title: IncRML -> Why “Incremental RDF Mapping Language?” Isn’t the approach more about Incremental RDF generation? I think the name of the approach is misleading.
- keywords (page 1 line 38): I don’t think the term “Incremental” alone provides sufficient information to be used as a standalone keyword. If needed to be emphasized, I recommend referring to the concept as “Incremental Knowledge Graph Creation/Construction”
- Introduction (page 2, line 24): a statement such as “We aligned our approach with RML and FnO to use best practices of KG generation” is inaccurate/incomplete and conveys a claim which is not intended to be discussed or proved in this paper. Therefore, I strongly encourage it to be avoided.
- Related work (page 5, line 3): “... existing work on incremental KG generation is either not implemented [51] or ….” Does this sentence mean that the novelty of your approach compared to this particular existing one is only in its implementation (I’d rather the term development though)? a. If that’s the case, it should be transparently communicated that the contribution aims for a system/resource paper rather than research. b. If not, you need to provide a better reason than “implementation” to support the novelty of your approach in comparison to the existing related works.
- section 4.3 (page 9, figure 2): Figure 2 is hard to be understood without reading the caption or the following explanation. While the caption and the following paragraph provide clear explanations, the texts in the flowchart can be improved accordingly, e.g., by adding a beginning step to the algorithm that shows where the members are coming from (as explained they are the IRIs of the members in the ”new” data collection)
- section 5.1 (page 11, line 24): what does the “state” of the property exactly represent? What information does it provide and what values can it have?
- section 6, evaluation: In the abstract (page 1, line 27), a qualitative evaluation (on a set of test cases other than GTFS) is mentioned as the contributions of the paper, however, in the section 6, there is no qualitative evaluation subsection and only quantitative evaluation exist. Is this section missed? If yes, please remove it from the contribution, if no, please add it to the title of the corresponding subsection.
- section 6.1.1: what is this subsection supposed to deliver? Is this supposed to be the only evaluation of the completeness of the results of this approach? As a reader (who has been waiting to read some qualitative evaluation on the approach since abstract where it is promised) receive no precise explanation of the result of this subsection and only “We validate that all combinations of these dimensions are possible and use these test cases to verify if our implementation covers all possible scenarios”. Beside, table 4 is not informative either. This section can be appriciated with some improvements.
- section 6.2 (page 19, line 36): Does the measured execution time also include the required time for Change Data Capture? If yes, please mention it in this section, if no, the experiments of this section need to be reconsidered to include the time required for detecting the changes.
- section 7 (page 23, line 36), figure 5 shows the results of only data size scale 100, why? As it is also mentioned in the beginning of the paper, the results on small size data are not as promising as the large scale, I wonder why this isn't reflected in section 7? It's essential for any optimization methodology to transparently acknowledge such variations, whether as limitations or conditions that impact the desired outcome. This transparency not only maintains the significance of the approach but also enhances the credibility of the scientific publication. Therefore, it would be valuable to repeat the illustrated chart visualization with data size scale 1 and 10. (Also note that, even if the results on scale 1 and 10 are not showing promising numbers, as long as the observed values for different cases follow the same patterns as in scale 100, they are supporting the final conclusion of the results.)

An important open question:
- how do you ensure the completeness of the changes detected and incremented to the KG? I can't find any theoretical proof, neither any strong experimental one.

Small formatting comments:
- page 7, line 37: “… created, updated, and deleted...”
- page 10, line 3: “... If not, a created member is found…”
- page 10, line 18: “.... versions (Section 4.3);…”

Review #2
Anonymous submitted on 01/Jul/2024
Suggestion:
Minor Revision
Review Comment:

In this article, the authors present the IncRML solution: an algorithm and implementation that apply the principles of Change Data Capture to adjust the operation of a knowledge graph construction process based on RML. The approach helps reduce the processing and storage burden of data inherent in the process by only signaling the necessary modifications to be made to the knowledge graph through a data stream encoded using the LDES vocabulary.

Overall assessment

1. Strengths:

- The IncRML proposal addresses an important challenge in the KGC community, which has not been solved in a generic way so far: managing the update of an existing knowledge graph by focusing on targeted modifications (addition, deletion, modification of entities) while using a declarative processing chain (i.e., RML) and semantic web standards (RML, FnO, LDES, SPARQL, etc.).
- Providing an implementation allows the community to validate the proposal and potentially contribute to its improvement.
- The evaluation of the proposal was conducted on test data and real datasets, which provides a good idea of the solution's applicability, particularly through the analysis of performance factors (e.g., data refresh rate, nature of modifications).
- Although the solution is not currently ready for production use out of the box (e.g., mechanisms for orchestrating IncRML need further discussion, resulting LDES data needs to be streamed and managed by graph stores), the explanations provided give a clear idea of the remaining developments to be done within and around IncRML.
- The authors effectively present the technical foundations used in the subsequent sections through the Related Work and Background sections.

2. Weaknesses:

- Despite the mentioned strengths of the proposal, the paper suffers from its form (e.g., burying key explanations in long paragraphs, redundant explanations, inconsistent notations), making the overall reading laborious and potentially limiting the adoption of the approach by individuals with a development and system architecture background. In my opinion, this is the essential adjustment that needs to be made to the article, which can be easily achieved with some courage from the authors to restructure and make necessary cuts in the text.
- Partly related to the previous point on formulation issues, the discussion of the triggering conditions for IncRML seems misplaced, raising doubts about the fundamental change detection mechanism in the proposed approach. The use of the term "advertisement" appears to be part of the problem (being more specific with "signaling" or "detection" could help). The tables and figures (especially Figure 3, which lacks details on the technical ecosystem, and Table 3, which is disconnected from the rest of the explanations) could be improved and integrated, which would easily resolve this issue.
- In subsection 2.1 of the Related Work section, the authors only provide a list of tool names without indicating their function, strengths, or weaknesses. Beyond the fact that a related work section should provide a minimum level of critical analysis of the mentioned items, it gives the impression that the authors are merely justifying the use of RML without further elaboration.

Questions & remarks:

* Could you elaborate more on how IncRML behaves when we don't have the first data frames of a given dataset (i.e. before any modification)?
* Did you looked at how IncRML behaves if the
* Could you elaborate on why you emphasize (many times) on the fact that RMLMapper has "100% coverage for all the test cases"? It seems that you want to say that IncRML is operational with RMLMapper (and thereby useful to practitioners) without really really saying it.
* The §4 starts with an approach overview, then goes into categorizing advertisement stragegies and principles of implicit/explicit change detection in two separate subsections. This seem a very natural way of breaking down the challenges IncRML tries to address, however it looks like in §4.3 that the core of IncRML stands on the shoulders of implicit changes and the implementation of a stateful processing (p. 9, lines 11-27) ... could you make more clear, starting in §4.1 & §4.2, what are the fundamental principles of the IncRML framework and how each "advertisement strategies" compose? It seems to me that this could be achieved by reorganizing the order of the subsections + reworking Table 1 with details of §4.2 & §4.3 + explaining from the begining if algorithms from Figure 2 run in parallel and if they have branching / common parts.
* It seems that the Figure 3 is a very (too much?) high level view of the IncRML proposal to give a stronger feeling of the technical solution. For instance, should we run multiple instances of this pseudo pipeline if we have to deal with multiple data sources? Is the CDC feature part of the framework or could it be an external component? Is the CDC block trigerring the KGC step? What kind of component consumes the "RDF Incremental" data? What (kind of) component orchestrates the execution of IncRML? and at what (potential) frequency? ... indeed it could be interesting to bring the Figure 3 closer to SysML / UML / BPMN / flow diagram standards. It could also be nice to show where the algorithms from Figure 2 live within such pipeline.

* Missing info: p.11, line 17, the footnote 10 is given (which is good) without further explanation of what it is about (i.e. context). It could be interesting to give this URL earlier in the article and saying that the IncRML core features are available in open source. Further, as the IDLabFunctions.java may evolve with time, it could be interesting to share this resource as a code snippet for a more stable reference from the article.

* Subsection 5.4 (p.14, line35): could you elaborate more on the usefulness of generating "3 named graphs"?

* It seems that explaining the IncRML approach through the example of Table 3 could be more impactful with minor modifications of the table. For example, it could be more interesting to have the table as a figure where we would see the initial data (part a.) aligned with the changed data (part b.) within a flow diagram highlighting with action tags (e.g. implicitUpdate) modified age values (46 & 40), the deletion of the row with ID #3 and the addition of the row with ID #4. The Figure 3 could serve as a basis for the flow diagram. The LDES structure from Figure 1 could also serve as a basis for explaining how updates materialize, instead of having a very generic Figure 1.

* Subsection title: for the §7.4 it seems to me that "lessons learned" would be a better title instead of "Discussion" as analyses of the results are already provided in each parts of §7.2 and §7.3. In the same line of thought, it could be nice to summarize the capabilities, performance gains, strengths & weaknesses of the whole evaluation process in a single table.

* Missing info/clarification: in §8 conclusion you very briefly suggest future work in terms of R&D axis although you mentioned limitations of your approach in §4.1 (i.e. "an additional step in the pipeline [...]") and have many lessons learned in §7.4. It could be interesting to elaborate with more details in §8 what would be the next steps to bring IncRML production-ready and suitable as a component of KG-based plaform.

Minor remarks:

* Listings are difficult to read in a black and white printed version of the article.
* Typo: p. 10, line 3 in "If not, an created member".
* Typo: p. 10, line 18 in "Section 4.3; and"
* Typo: p. 10, line 22, "should change" instead of "have changed"?
* Typo: p. 10, line 23 in "TriplesMap8.)"
* Formating issue: p.10, footnote 8, on the "as:" prefix
* Listing 4: it could be interesting to highlight the "idlab-fn:implicitUpdate" term.
* Typo: p. 13, line 2, reference [5] was already defined earlier in the text and, by the way, is repeated again latter.
* Listing 5: it would be nice for you to improve the readability of the listing by inserting blank lines between code blocks and more explanations as code comments (e.g. the explanations from the listing's caption).
* Formulation unclear/confusing: p. 16, line 40, in "may achieve, might be".
* Table 4: the table might be useless (e.g. it could be shown only in the code repo) or, at least, not informational or not well formated (e.g. "purpose" could be symbols defined in the table's caption, "ID" could hold the availability/advertisement/type in the file name, a \hline could split the table between each availability category).
* Subsection title: for the §6.1.3 (and generaly) it might be interesting to use the term "dataset" instead of "use case".
* Table 6-8: it could be nice to explain that underlined numbers mean "lowest observed value", and also remind that, in the present cases, "the lowest the better".
* Table 6-8: it could be nice to add columns with an all/change ratio or percentage for quick analysis of the results.
* Formulation issue: p. 20, line 31 (and generally), it is preferable to argument with quantitative values instead of solely using stances such as "a slight increase" => "a slight increase (xxx% in average)" or "a xxx% increase in average, which we consider as minimal with respect to ...". This is to be found again in §8 line 12 where you write "We show how our approach reduces [...]" instead of a more constrasted formulation like "To achieve this we did ..., which, through an evaluation process based on 5 representative datasets, showed that our approach in average reduces [...] by a factor xxx% [...]".
* Formulation unclear: p. 21, line 50, it looks like writing "We observe an increase in the amount of [...]" would make the paragraph more nice to read.
* Typo: p. 25, line 50 in "the additional the metadata"
* Typo: p. 26, line 43 in "was deleted", don't you miss a "thus"?
* Formulation unclear/inconsistent: p.26, lines 45 & 47 + p. 27, line 43, and generaly, it could be nice to use a same way of writing performance gains across the whole article (i.e. instead of "4.42 times faster", then "factor 3.24" without positive/negative indication).

Review #3
By Umutcan Serles submitted on 10/Jul/2024
Suggestion:
Minor Revision
Review Comment:

This paper proposes an approach for incrementally constructing knowledge graphs from heterogeneous sources. It also proposes an extension to RML, namely a new logical target for Linked Data Event Streams (LDES). The approach evaluated from various perspectives, such as functionality and resource usage for different types of resources.

(1) originality

The approach certainly appear original, particularly due to its capability to handle different kind of data sources and formats as well as handling implicit (non advertised) changes. The related work is comprehensive and cover different aspects. For each covered aspect, the advantages and differences of the approach over the literature is argumented. Therefore I would say the paper has no major issues with regards to originality.

(2) significance of the results, and

The most important result of the paper is an approach to detect changes for create, update and delete operations for different different kind of communication strategies and differeny level of availability of history. This approach is implemented as function maps in RML mappings, which allows knowledge graph construction pipelines to detect changes before materializing entire knowledge graph when the circumstances allow. In many cases, such an approach provides significant improvement on resource usage as also shown in the evaluation. To improve the presentation of the results, I have some minor change recommendations:
- The most challenging change detection is the implicit updates, as it requires to identify certain properties to watch for change in values. The example regarding to this a bit too simplistic (PS. I am also a big fan of Person of Interest). To me it is not really clear if the current syntax of specifying watched properties can accommodate nested objects. Are they handled via property paths or via some other mechanism?
- The discussion reads more of a duplication of the evaluation section. I think it would be much better to actually discuss the results and provide insights particularly about the situations that makes it reasonable to apply such an incremental approach. For example, the memory usage seems to be not improved at all with this approach, so what kind of use cases in real life would not be worth adopting such an approach? What kind of use cases benefit from such an approach the most? Such a discussion would be much more beneficial rather than repeating the evalution section.

(3) quality of writing.
The writing has overall a good quality. There are a few typos and presentation issues that can be addressed:
- p 4 l 33: tight -> tied
- p 6 l 29: a RDF -> an RDF
- p 10 l 3: "an created member"
- p 7 l 25: Section > section (there are a few like this. There is no reason to capitalize the first letter of section, if it is not followed by a section number)
- fig 3: No explanation what different colors mean.
- p 11 l 17: gap between 4.3 and footnote indice.
- Listing 5: EvenStream -> EventStream

Overall I find the paper valuable. The approach appears to be technically solid and extensively evaluated. I do not see any reason to reject the paper, given the minor revisions are applied.

Review #4
Anonymous submitted on 16/Jul/2024
Suggestion:
Major Revision
Review Comment:

The resource under review, "IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources", describes an incremental approach towards the construction of KG resources that enables the detection, versioning and RDF materialisation of datasets that change over time. To do this the authors do not only describe their approach but also investigate change detection strategies in datasets, provide algorithms for the detection of said changes in datasets (either explicit or implicit) and also provide an implementation of their research. Their approach is then contrasted with re-constructing a KG in its entirety when a change in the data sources has been made through a benchmark and use-case based experimental evaluation.

Originality

In terms of originality, the novelties that the articles puts forward are related to the incremental approach that they propose for building KGs over dynamic sources, the design of algorithms for detecting both explicit and implicit changes made to underlying source data inputs and the shifting of the burden of versioning from the producers of data to the consumers of data. The authors furthermore claim to provide a classification of change advertisement strategies, although it is unclear in the paper whether this is taken to be an original contribution (See point 3 under Quality of writing below).

Significance of the results

The authors have produced a benchmark and a copious amount of use cases and empirical evidence to support the evaluation of their approach and associated implementation. For this they should be commended. There is, however, an issue with the evaluation methodology that requires addressing. This follows from the authors’ choice of shifting the burden of versioning to the consumer of the data which affects the chosen strategy comparison.

In section 6.2 the authors write that they are going to be comparing two strategies: the strategy that materialises the KG every time an update has been made, referred to as “ALL”, and the authors’ own incremental strategy, referred to as “CHANGE”. The problem lies though in this sentence, where for the CHANGE strategy “we initially materialize a complete version of the KG, but only materialize into RDF the actual changed members upon updates of the data collection, not the complete KG” (pp. 19). The problem is that it appears to me that the two strategies under considerations produce different objects, one a materialised KG, the other the triples of the modified members, and as such cannot be meaningfully compared. Within this experimental scenario it must be considered to also perform the reconciliation step that a consumer of the data might want to make at the end of the CHANGE strategy. Therefore two KGs are now the outputs and these can be then compared quantitatively using the metrics and the use cases presented.

A second issue with the results presented revolves around the example provided in section 5.4 listings 6 and 7, which for me creates more questions than it answers. In listing 6, which represents the initial KG materialisation of the example data, there are syntactic mistakes: the predicate “a” is missing and some inverted commas are also missing for the string literals. This though could simply be a typesetting error.

In listing 7 instead, which represents the materialised changed members, only the triple

a foaf:Person .

has been deleted, while the associated triples

foaf:name "Agent Carter" . foaf:age 36"^^xsd:int .

of the first graph have not been included in the deleted elements. My understanding is that this follows from the “tombstone” approach taken for deletion changes (page 10, section 4.4), but I am not convinced by its viability. First of all, the URIs appear to contain versioning information within them, in the last digit after the hashtag, and therefore they do not match in the two listings. How is this going to impact reconciliation? Is the consumer expected to modify all the URIs in order to reconcile the graphs? Second, how are incoming links into triples being accounted for by the implementation? I think this is worth considering as while the incremental approach might be better computationally in terms of time, memory usage, etc., if it makes life more difficult for the consumers making the local reconciliation then that can negatively affect the impact of this research. A useful element here would be to specify well the problem domain within which this approach may prove to be more fruitful for data consumers.

Finally, if we take a look at the modified triples for “Harold Finch” , we see that although only the age attribute has been modified, all the triples corresponding to him have been added in the changed data collection, with different URIs, even though two of the three triples have actually not been modified and are the same as in listing 6. The same considerations made in the paragraph above apply here.

Quality of writing

The article is generally well written. There are however a few exceptions that the authors should look into.

Very minor, but on page 2 the authors describe their approach as “holistic” as “it is designed to handle data changes from data sources which advertise their changes [...] but also from data sources which do not advertise their changes [...]”. I do not believe that “holistic” is actually the right adjective here, as it usually refers to a “whole” that is not reducible to its parts or to the explanation of the parts through their role within the whole of which they are parts. Maybe “flexible” or “general” might be more appropriate here.
There are a few sentences that may require either a reference or rephrasing as an opinion or assumption that the authors are taking. Examples include on page 2, the sentence starting with “This approach is unsustainable [...]” and the sentence starting with “However, current approaches for integrating heterogeneous data [..]”.
On page 2, when describing the contributions made by this paper, the authors state “(i) an overview of different data collection categories regarding history and change advertisement”. On page 8, section 4.2 introduces “several change advertisement strategies can be identified if we analyze data collections”, which I understand to be the corresponding section to contribution (i) just mentioned. But the data collections that are being talked about have not been specified in the paper, nor has how the authors have reached this classification. Has a survey been conducted in preparation for this work, or has a literature review been consulted? Are these observations from experience that are being turned into a postulate for the algorithms? I believe more detail is necessary here in order to be able to evaluate this specific contribution.
The paper requires further copyediting. There is some punctuation missing here and there and some typos, for example, on page 21 we read “Increasing the amount of changes increases the storage usage with our approach can achieve”. I think it should be "that" instead of "with". Also, there is a bit of repetition in the paper that I think can be ironed out.

Resources

The authors provide a stable URL link to a Zenodo page containing the materials used for the experiments and evaluation. This page has a README file that explains how to access the data and reproduce the experiments. The provided data artifacts appear to be complete.