Algebraic Mapping Operators for Knowledge Graph Generation

Tracking #: 3806-5020

Authors: 
Sitt Min Oo
Ben De Meester
Ruben Taelman
Pieter Colpaert

Responsible editor: 
Oscar Corcho

Submission type: 
Full Paper
Abstract: 
Recent advancements in declarative knowledge graph generation have led to the development of multiple mapping languages, their various versions, and different mapping engines that can interpret these languages and execute the mapping process. The field has progressed to the extent that current studies are now more focused on optimizing the knowledge graph generation process. Although different mapping engines share the common functionality of generating knowledge graphs from heterogeneous data sources, sharing the various optimization techniques and features of these engines remains challenging due to the lack of formal operational semantics for the general mapping processes. A set of algebraic mapping operators can provide the necessary operational semantics for general mapping processes, establish a theoretical foundation for mapping languages, and facilitate the introduction and evaluation of a compliant implementation, that is capable of interpreting and executing multiple mapping languages. In this paper, we propose such an algebra based on the SPARQL algebra. This allows us to maximally reuse established definitions, and further bridge the world of knowledge graph generation with query engines. To evaluate that our work is not limited to a single specific mapping language, we translated mapping languages ShExML and RML to our mapping plan composed of algebraic mapping operators. The results of our completeness evaluation shows that our algebraic operators cover the operational semantics of RML and partially for ShExML. To fully cover ShExML, further analysis into ShExML’s concise operational semantics is needed (e.g. for joining data from two input sources). For performance evaluation, our proofof-concept algebraic mapping engine has a consistent and low memory usage across the different workloads, and achieved second place in the Knowledge Graph Construction Workshop’s performance challenge. Algebraic mapping operators decouple mapping engines from the mapping languages, enabling multilingual mapping engines. Furthermore, the mapping plan can incorporate optimization techniques as a separate process from the mapping itself, allowing us to benefit from stateof-the-art mapping process optimizations. The proposed set of algebraic mapping operators will lay the foundation for future studies on the theoretical analysis of complexity and expressiveness of mapping languages, and will provide consistency in the execution semantics of mapping engines. Furthermore, the alignment of our algebra with SPARQL will enable further research into advanced methods such as virtualization, enabling heterogeneous data querying.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Mario Scrocca submitted on 15/Apr/2025
Suggestion:
Minor Revision
Review Comment:

The authors addressed all the comments from my previous review, and I believe the paper is now easier to follow thanks to the improvements in the definitions and the additional examples.
However, I recommend a revision considering the work recently accepted for publication by the same first author ("An Algebraic Foundation for Knowledge Graph Construction"), which presents another similar but different formalization of the KG construction process.
I believe this paper still represents a valid contribution, considering the implementation and evaluation of the proposed formalisation through the described tools. However, a comparison is necessary to justify the original contributions of this work (e.g., as part of the related work section) and outline the potential future work towards a single formalisation of the KG construction process.

Review #2
By Jose Emilio Labra-Gayo submitted on 27/Apr/2025
Suggestion:
Minor Revision
Review Comment:

Given that I have already done a review on the previous version, in this new review I can indicate that the authors addressed most of my comments and refer to my general assessment from the previous version.

I think the authors have improved the paper's readability and I have only some minor concerns:

I noticed that the first authors of the paper has recently published another paper in Arxiv which I think has been accepted to the ESWC: https://arxiv.org/abs/2503.10385 , however, that paper is not cited from the current paper. I assume the reason is that the other paper was written after this one. Nevertheless, as some of the contents are related, I think it would make sense to add a paragraph or some comparison between the Arxiv version and the current paper.
Definition 1 indicates that a fragment is a submultiset with a reference to 34…I think it would be more readable if it was described as a subset of a multiset or some concept which could be more familiar to a reader without having to look to a multiset theory external reference..
The example in Listing 1 has the age of Susan Sue with value 23 but in table 1 the age of Susan Sue is 25…although I understand that they could be based on different data, I think it would probably be better if that value was the same. I wonder also, if the JSON data could also include information about Alice Joe to make it compatible with table 1.
In example 2 the projection uses the variables { ?name, ?pet,name and ?pet.type } which are shown in table 3 with a different order ?name, ?pet.type and ?pet.name, maybe better if they are in the same order.