IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources

Tracking #: 3674-4888

This paper is currently under review
Dylan Van Assche
Julian Rojas
Ben De Meester
Pieter Colpaert

Responsible editor: 
Guest Editors KGC 2024

Submission type: 
Full Paper
Sharing real-world datasets that are subject to continuous change (e.g, additions, modifications and deletions) poses challenges to data consumers e.g., historical versioning, change frequency. This is evident for Knowledge Graphs (KG) that are materialized from such datasets, where keeping the graph synchronized with the original datasets is only achieved by frequently and fully regenerating the KG. However, this approach is time-consuming, loses history, and wastes computing resources due to reprocessing of unnecessary data. In this paper, we present a KG generation approach that is capable of efficiently handling changing data sources and their different data change advertisement strategies. We investigate different change advertisement strategies of data sources, propose algorithms to detect implicitly advertised changes, and introduce our approach IncRML (Incremental RDF Mapping Language). IncRML combines RML and FnO functions to detect and classify changes in data sources across an RML mapping process. We also allow to optionally and automatically publish these changes in the form of a Linked Data Event Stream (LDES), relying on the W3C Activity Streams 2.0 vocabulary to describe changes semantically. This way, the changes are also communicated to consumers. We present our approach and evaluate it qualitatively on a set of test cases, and quantitatively using a modified version of the GTFS Madrid Benchmark (taking change into account), and various real-world data sources (bike-sharing, transport timetables, weather, and geographical). On average, our approach reduces the necessary storage and used computing resources for generating and storing multiple versions of a KG (up to 315.83x less storage, 4.59x less CPU time, and 1.51x less memory), reducing KG construction time up to 4.41x. The performance gains are more explicit for larger datasets, while for smaller datasets, our approach’s overhead partially nullifies such performance gains. Our approach reduces the overall cost of publishing and maintaining KGs, which may contribute to the uptake of semantic technologies. Moreover, the use of LDES as a Web-native publishing protocol enables that not only the KG publisher benefits from the concise and timely communication of changes, but also third-parties if the publisher chooses to make it public. In the future, we plan to explore end-to-end performance, possible optimizations on our change detection algorithms and the use of windows to expand our approach to streaming data sources.
Full PDF Version: 
Under Review