Optimizing SPARQL Queries over Decentralized Knowledge Graphs

Tracking #: 3312-4526

Christian Aebeloe
Gabriela Montoya
Katja Hose

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
While the Web of Data in principle offers access to a wide range of interlinked data, the architecture of the Semantic Web today relies mostly on the data providers to maintain access to their data through SPARQL endpoints. Several studies, however, have shown that such endpoints often experience downtime, meaning that the data they maintain becomes inaccessible. While decentralized systems based on Peer-to-Peer (P2P) technology have previously shown to increase the availability of knowledge graphs, even when a large proportion of the nodes fail, processing queries in such a setup can be an expensive task since data necessary to answer a single query might be distributed over multiple nodes. In this paper, we therefore propose an approach to optimizing SPARQL queries over decentralized knowledge graphs, called Lothrok. While there are potentially many aspects to consider when optimizing such queries, we focus on three aspects: cardinality estimation, locality awareness, and data fragmentation. We empirically show that Lothbrok is able to achieve significantly faster query processing performance compared to the state of the art when processing challenging queries as well as when the network is under high load.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Olaf Hartig submitted on 05/Feb/2023
Review Comment:

I am happy to see that the authors have addressed the minor weaknesses pointed out in my review of the previously submitted version of this manuscript. There are the following two new minor issues in the formalization. However, I am sure that the authors can fix them when preparing the camera-ready version and, thus, I propose to accept this manuscript now.

* The new definition of S_P that was added before Def.2 (in Sec.3) is incorrect. Why don't you simply define S_P as follows?
S_P = { s | (s,p,o) \in P }

* The definition of "the answer to P over G" in the paragraph after Def.3 is still incorrect. For a solution mapping μ to qualify to be in the query result, it must not only satisfy the condition given in the text (μ[P] ⊆ G) but, additionally, its domain must not contain any variables other than the variables in the given BGP; i.e., dom(μ) = vars(P). Without this additional condition, the query result [[P]]_G will be an infinite set.

Review #2
By Antonis Troumpoukis submitted on 13/Feb/2023
Review Comment:

In this paper, the authors propose Lothbrok, a system for processing SPARQL queries on decentralized knowledge graphs over a P2P architecture.

The approach is novel, it fits well on the Semantic Web community, and it is of high impact in the context of decentralized SPARQL processing systems. The literature review covers a wide range of relevant systems and approaches, and the authors describe their methodology in a sufficient and detailed way. Lothbrok is evaluated both in synthetic and real-world benchmarks, and it is compared with the previous P2P approaches of the team Piqnic and ColChain, on which Lothbrok builds upon. The experimental results suggest a significant improvement on query processing in almost every query load evaluated with. The authors provide a long-term stable URL that contains the code and the datasets needed to reproduce the experiments.

All my comments in my previous review have been addressed satisfactorily. I feel, and I hope that the authors agree, that the quality of their article is sufficiently improved.

Review #3
Anonymous submitted on 27/Mar/2023
Minor Revision
Review Comment:

Overall, the authors have provided a satisfactory response to the comments. They have addressed each comment individually and provided clear explanations for their choices. Here are some possible suggestions for further improvement:

Regarding the first comment about the definition of knowledge graphs not signifying that the systems work as a normally distributed graph, the response is appropriate. The authors have acknowledged that their work focuses specifically on knowledge graphs and have added a clarification in Section 1 to avoid any misunderstandings. They have also indicated that they will consider general distributed graphs in future work, which shows that they are open to exploring other types of graphs beyond knowledge graphs.

In response to the second comment about the lack of coverage of knowledge graph querying literature, the authors have acknowledged that they purposefully focused their discussion on the most related querying literature, such as the LDF framework, federated query processing, and Peer-to-Peer systems. They have also indicated their willingness to expand the discussion if the reviewer provides more specific details about what aspects are missing.

the authors have added overview figures to two sections in the paper, as well as expanding Table 3 and updating Section 4.2 and adding a paragraph to the end of Section 4.3 to incorporate the discussion on the impact of graph complexity on the system. They have also explained why they did not aim to beat centralized systems in terms of performance but rather focused on making query processing in the decentralized setup feasible with high query scalability. They have simplified some sentences and clarified others to make the paper easier to follow.

They explained that there may be non-relevant data obtained through the query fragment, but they try to avoid it by pruning non-relevant fragments in the source selection step. They differentiate the compatibility graph from the fragmented KG by saying that the compatibility graphs capture which pairs of fragments are compatible for a given query. The authors have simplified the definition of a compatibility graph to avoid confusion. The authors have updated the introduction to Section 4.3 to more clearly motivate SPBF indexes. They have added a clear example to the beginning of Section 5.1 with a visual element in Figure 7 that shows exactly what compatibility means. They have added Definition 14, which defines the relevantFragment(P,f) and updated the text throughout Section 5 with paragraphs stating what relevance means in this context. The relevance of fragments is a binary value; yes or no, rather than computing relevance scores. The authors also explained that the execution plans in Table 2 contain all the relevant fragments found in the query.

Page 6 Definition 3
The definition appears to be correct and well-defined. However, there are a few potential issues that could arise in practice and it would be great if authors can explain this:

The size of the set of solution mappings [[P]]G can be very large, making it computationally expensive to enumerate all possible mappings. In practice, some optimization techniques may be needed to efficiently compute the set of solutions.

The definition assumes that the underlying knowledge graph G is static, and does not change over time. However, many real-world knowledge graphs are dynamic, meaning that new triples can be added or existing triples can be deleted or modified. The definition does not specify how to handle dynamic changes to the knowledge graph, which could affect the correctness and completeness of the results.

The definition does not specify how to handle inconsistencies or contradictions in the knowledge graph. In practice, it is common for knowledge graphs to contain conflicting or inconsistent information, which could lead to unexpected or erroneous results. Additional techniques may be needed to handle such cases.

Page 12: Line 20-33
The statement suggests that merging small fragments can improve lookup time for optimizing join order and estimating cardinalities, but it is not clear how significant this improvement is or if there are any trade-offs (such as increased memory usage for larger fragments).

Page 13: Line 19-34
the complexity and potential performance impact of the proposed Semantically Partitioned Bloom Filters (SPBFs) indexing schema. This may need to be evaluated further in real-world scenarios to ensure that it is practical and efficient. Additionally, the fact that the SPBF indexes have to match entire star patterns to fragments rather than triple patterns, as opposed to the PPBF indexes from [19], may require some adjustment in the query optimization process.
Page 16 Line 1-20
the authors assume that query execution plans are always left-deep, which may limit the potential for optimization in certain cases. This is a limitation that could be addressed in future work as well.