Abstract:
Sharing real-world datasets that are subject to continuous change (creates, updates and deletes) poses challenges to data consumers, e.g., reconciling historical versioning, handling change frequency. This is evident for Knowledge Graphs (KG) that are materialized from such datasets, where keeping the graph synchronized with the original datasets is only achieved by frequently and fully regenerating the KG. However, this approach is time-consuming, loses history, and wastes computing resources due to reprocessing of unnecessary data. In this paper, we present a KG generation approach that is capable of efficiently handling evolving data sources with different data change signaling strategies. We investigate different change signaling strategies observed in real-world data sources, propose the corresponding algorithms to detect data changes, and introduce a declarative approach that relies on RML and FnO to materialize and characterize data changes for evolving KGs. Our approach allows to optionally and automatically publish detected data changes in the form of a Linked Data Event Stream (LDES), relying on the W3C Activity Streams 2.0 vocabulary to describe changes semantically. This way, changes can be communicated to consumers over the Web. We implement our approach in the RMLMapper as IncRML (Incremental RML). We functionally evaluate our approach on a set of test cases, and quantitatively using a modified version of the GTFS Madrid Benchmark (taking change into account), and various real-world data sources (bike-sharing, transport timetables, weather, and geographical). On average, our approach reduces the necessary storage and used computing resources for generating and storing multiple versions of a KG (up to 315.83x less storage, 4.59x less CPU time, and 1.51x less memory), reducing KG construction time up to 4.41x. The performance gains are more evident for larger datasets, while for smaller datasets, our approach’s overhead partially nullifies such performance gains. Our approach reduces the overall cost of publishing and maintaining KGs, which may contribute to the uptake of semantic technologies. Moreover, the use of LDES as a Web-native publishing protocol enables that not only the KG publisher benefits from the concise and timely communication of changes, but also third-parties if the publisher chooses to make it public. In the future, we plan to explore end-to-end performance, possible optimizations on our change detection algorithms and the use of windows to expand our approach to streaming data sources.