Optimizing SPARQL Queries over Decentralized Knowledge Graphs

Tracking #: 3139-4353

Authors: 
Christian Aebeloe
Gabriela Montoya
Katja Hose

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
Abstract: 
While the Web of Data in principle offers access to a wide range of interlinked data, the architecture of the Semantic Web today relies mostly on the data providers to maintain access to their data through SPARQL endpoints. Several studies, however, have shown that such endpoints often experience downtime, meaning that the data they maintain becomes inaccessible. While decentralized systems based on Peer-to-Peer (P2P) technology have previously shown to increase the availability of knowledge graphs, even when a large proportion of the nodes fail, processing queries in such a setup can be an expensive task since data necessary to answer a single query might be distributed over multiple nodes. In this paper, we therefore propose an approach to optimizing SPARQL queries over decentralized knowledge graphs, called Lothrok. While there are potentially many aspects to consider when optimizing such queries, we focus on three aspects: cardinality estimation, locality awareness, and data fragmentation. We empirically show that Lothbrok is able to achieve significantly faster query processing performance compared to the state of the art when processing challenging queries as well as when the network is under high load.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Antonis Troumpoukis submitted on 18/Jul/2022
Suggestion:
Major Revision
Review Comment:

In this paper, the authors extend their previous systems Piqnic and ColChain by proposing Lothbrok, a system for processing SPARQL queries on decentralized knowledge graphs over a P2P architecture. The approach is novel (to the best of my knowledge), it fits well on the Semantic Web community, and it is of high impact in the context of decentralized SPARQL processing systems.

The literature review covers a wide range of systems and approaches not only for P2P systems that are relevant to Lothbrok, but also in federated query processors and classical client-server architectures as well. I do not see any missing works from the literature.

The authors provide several contributions in their article. Lothbrok uses characteristic sets as a fragmentation technique, and it uses a new type of bloom filters (namely Semantically Partitioned Bloom Filters, or SPBF) for indexing. The idea is for each fragment to keep several lists of bit vectors, one list for the subject and one list for each object of the characteristic set; each element of each list corresponds to a specific URI prefix. SPBF indexes then are used 1) to develop a new cardinality estimator for providing more accurate estimations and 2) to develop a new query optimization technique that reduces communication cost by delegating subqueries to several P2P nodes. This optimization also makes use of a new concept of compatibility graph in order to discover non-joinable fragments and produce more efficient query execution plans.

Lothbrok is evaluated both in synthetic and real-world benchmarks, and it is compared with the previous P2P approaches of the team Piqnic and ColChain. The experimental results suggest a significant improvement on query processing in almost every query load evaluated with. Τhe authors demonstrate scalability using the synthetic benchmark WatDiv, and they show that their system improves query processing time in real-world datasets and queries using LargeRDFBench.

The authors provide a long-term stable URL in Zenodo that contains the datasets needed to reproduce the experiments. The code for Lothbrok is not included in Zenodo but it is included in the following GitHub repository: https://github.com/dkw-aau/Lothbrok-Java (I found it through the Lothbrok website since it does not appear in the paper). I suggest adding also the github link in the SWJ website; if this is not possible, I suggest including the Lothbrok code in the Zenodo repository.

Regarding the shortcomings of the work, my first concern is the presentation of the approach. I found that some parts of the text are too technical and convoluted, and I believe that there is much room for improvement in the quality of the writing. I will go through the text of all sections later in my review to provide some suggestions for improvement.

The second major issue of the paper is the fact that information about the compatibility graph is missing. The authors mention that their query optimization technique is “based on the compatibility graph” (pg.20 ln.7), however it is not clear to me how exactly they make use of the compatibility graph (it is never mentioned when the optimization algorithm is discussed in Section 5.3). I suggest that the authors add some more material on this and, if possible, to include the exact definition of the form of the DP algorithm of their approach. Moreover, I missed some explanation regarding the correctness of the compatibility graph optimization. To my understanding, the compatibility graph is used to discover non-joinable fragments in order to reduce the set of the candidate plans by the DP algorithm. However, I missed any discussion whether each of the plans that the optimization prunes indeed contain non-joinable fragments. In other words, I would have expected that the authors would have provided a simple proof of their claim that the pruned fragments “do not contribute to the overall query result” (pg.13 ln.20).

Finally, regarding the evaluation of the technique, the authors mention that Lothbrok does not perform well in the watdv-paths workload, which is an expected behavior, since the previous approaches use predicate-based fragmentation. I was wondering if the authors have an idea of how this problem would be solved; is this maybe a nice topic for future work? It would have been nice if we had some relevant discussion in Subsection 7.6 or Section 8 about this.

Other comments:

Section 3 covers the essential background and the preliminary definitions for the approach. However, it seems that there is some information appearing in the remainder of the paper that can be moved here (see the relevant discussion on Section 5).

Section 4 describes the fragmentation technique and the SPBF indexes. The first two subsections are ok, but Subsection 4.3 is difficult to follow. The presentation is too technical and the text that follows Definition 13 until the end of the subsection is too brief and difficult to understand. Moreover, the definitions do not mirror exactly the ones given in Section 3.2. I would suggest polishing the text and the use of an example of the usage of the definitions of SPBF index and slice and the definition of relevantFragment to clear things up.

Section 5 presents the query optimization part of Lothbrok. The section describes the optimizer in detail, but, as previously, I believe that there is much room for improvement regarding the presentation of the material. In Subsection 5.1 the authors define the compatibility graph in terms of the SPBF indexes, making the section too technical and difficult to read. I suggest breaking Definition 16 in several smaller definitions to improve the presentation. Regarding Subsection 5.2 and 5.3, it seems that the authors mix background information (e.g.. equation 3), definitions that make up the cardinality estimator (e.g. equation 5) and examples in the same text. Some equations contain arithmetic calculations that take unnecessary space with many. Also, some definitions regarding cardinality estimations are also included in Subsection 5.3 (instead of Subsection 5.2). I would suggest moving some background definition in Section 3, and placing all definitions of the cardinality estimator in one place.

Summary:

(+) the problem fits well in the Semantic Web community
(+) the approach is novel (to the best of my knowledge)
(+) the evaluation results provides significant improvement of the state of the art
(-) the presentation of the work and the quality of writing could be improved
(-) the compatibility graph optimization is not properly documented
(-) the approach does not perform well in “path” queries

Review #2
Anonymous submitted on 03/Aug/2022
Suggestion:
Major Revision
Review Comment:

In the present paper, the authors have presented an idea to optimize the decentralized system using a fragment-based approach called “LOTHBROK”. The system has shown the incremental effect on some of the aspects and has some positive aspects and some limitations as well.
The positives and critiques of the paper are as follows:
1. The language of the manuscript is comprehensible and also well-structured and follows a detailed methodology with moderate validations.
2. The definition of the knowledge graph, as well as utilization across the manuscript, doesn’t signify that the systems work as a normally distributed graph.
3. The authors have given detailed definitions of all the concepts, a detailed introduction, and mathematical annotations.
4. The Related work covered in the paper is satisfactory, except the authors failed to cover the knowledge graph querying literature, having this will make this work sounder and more detailed.
5. The work is primarily the continuation of query optimization work over a period of time and the current system has compared and tried to outperform the two latest works namely “PIQNIC” and “COLCHAIN”.
6. The fragmented query distributions in the decentralized system on top of decentralized querying make this work novel and interesting.
7. Authors have covered almost all the aspects of query engine testing such as cost distribution, network overload, benchmarking on queries, time and other performance evaluations are satisfactory.
8. Overall, the authors provided extensive definitions of their proposed methodology along with logically defined algorithms and terminology with required examples.

As query optimization is well a studied field in the last few years there are some questions that needed to be addressed, along with some of the minor writing a structural change in the present form of the manuscript.
1. The complexity of the paper, requires having a flow diagram, in the beginning, listing all the steps overview.
2. Authors have failed to discuss the graph complexity in the fragmented approach, often in the distributed querying system a query response and plan are also affected by the graph centrality measures. It is critical to have a discussion over how the fragments are affected by graph complexity. Further, the impact of graph complexity on the index is also missing.
3. Authors need to have a section where it would be great to document their point of view/comparison of their system with centralized and how their querying mechanism will be better and make the decentralized system more effective.
4. Page 5-line no: 11-13, the fragmented datasets a later new indexing scheme … the sentence is unclear and not supported in the manuscript with appropriate evidence.
5. Page 7- Line no 17-19: the distributed index and distributed slices, and mapping of them with the distributed query are difficult to follow.
6. Page 8-Line no 1-8: “In other words, [19] represents the combined set of subjects and objects in a fragment in a prefix-partitioned bitvector” line is unclear, however, it’s difficult to follow how the indexing method used in proposed approach better than compared methods.
7. Page 9: Line no 10-12: how the authors have defined the relevance of fragments, later the selection of fragments based on relevance. Did the authors use any method to find which fragments are more relevant to the query or the query fragment? One suggestion can be to look into for top top-k fragments for each KG graph.
8. Do the queries follow feedforward systems where after the satisfactory completion of each query fragment, is there a way to avoid non-relevant data obtained through this fragment?
9. Page 10: Line no 4-7: How do authors differentiate the compatibility graph from the fragmented KG, is it really a KG or just a set of compatibility graphs?
10. Page 11: Line no 44-47: “---rather they associate the predicate value with the index slice itself (Definition 9),” not clear in definition 9, so difficult to follow. Is there any example where “LOTHBROK allows for fragments with several distinct predicates?” If so how does affect the query performance?
11. Page 13: Line no 1-3: repetitive and authors can summarize this in one line or so (minor)
12. Authors need to show the fragment compatibility in context to query fragment, it would be great to have a clear example about it or a set of queries that can explain that.
13. Cardinality estimation of each fragment is valid, however, what is the cardinality index of non-relevant to more or total relevant fragment?
14. Table 2, query execution cost, is it relevant to all fragments or a limited set of fragments? how did the authors decide the relevance of the subquery, is there a way to compare relevant vs more relevant query cost comparison?
15. Page 26: Line no 25-36, many steps and difficult to follow through text good to have a flow diagram.
16. The performance under heavy network load can be dependent on the similarity of fragments, fragment size, and coplanarity of fragments, this workload estimation based on this need to have a detailed time-out table If possible, or else a few sentences to explain this should be enough as well.
17. Did the authors look into the path queries ho their system performs when different members of the path are distributed in various fragments?
18. Overall an interesting and detailed article, on efficient query processing.

Review #3
By Olaf Hartig submitted on 05/Sep/2022
Suggestion:
Minor Revision
Review Comment:

This manuscript presents a next step that improves upon the authors' earlier work on partitioning and querying RDF datasets using peer-to-peer approaches. The contributions of this manuscript are i) a new technique for partitioning a dataset by considering Characteristic Sets, ii) a new indexing scheme that uses separate Bloom filters for subjects and for each predicate (indexing the objects associated with each predicate), and iii) a query optimization strategy that leverages the new indexing scheme. A comprehensive evaluation shows that these improvements outperform the baseline consisting of the authors' earlier systems (which have been the state of the art), and it also highlights weaknesses of the new improvements (namely, that their effectiveness for path-shaped query patterns is limited and that they may even have slightly worse performance for such queries).

The manuscript is well written (although it felt a bit dry and mechanical at times) and defines the proposed techniques and approaches in great detail. A running example illustrates the different concepts of the approaches further. I particularly appreciate the figures that visualize the steps of the cardinality estimations. In summary, the definitions, descriptions, and examples equip the reader with everything necessary to reproduce the authors' implementation of the approaches.

The evaluation of the proposal is comprehensive, including experiments based on a benchmark using synthetic datasets and queries as well as a benchmark with real domain datasets. The experimental setup is reasonable and described in sufficient detail. The evaluation provides a good understanding of the behavior of the proposed approach and how it compares to / improves upon the state of the art in terms of scalability (increasing numbers of peers issuing queries), impact of different types of query patterns, and network usage.

These are all positive aspects of the manuscript. The only minor weaknesses are that i) the authors completely omit the question of dataset dynamics (what happens if the dataset changes?), ii) the strategy for dealing with "fragments with infrequent characteristic sets" (end of Sect.4.2) is not entirely clear, and iii) the formal part of the manuscript contains a few minor inconsistencies and flaws (see below).

Therefore, I propose to ask the authors for a minor revision of the manuscript that addresses these weaknesses.

In particular, regarding the dataset dynamics, I would like the authors to include an explicit statement about the fact that the proposed approach focuses on static datasets. Regarding the strategy for dealing with "fragments with infrequent characteristic sets," I would like the authors to extend their description such that it answers the following questions:

* How do you decide which fragments to merge?
* ... and which ones to split?
* ... and how exactly to split them? (i.e., which predicates remain together?)
* When splitting a fragment, are the triples of that fragment duplicated into the other two fragments?

Moreover, the following minor issues in the formalization need to be fixed.

* Sect.3 It is not clear what "conjunctive triple patterns" are.

* Def.2 For the purpose of a complete formal definition, the notion of "subjects" of a BGP should be defined formal.

* Def.3 This definition introduces the notion of solution mapping as a concept that depends on a given knowledge graph (KG), which is not the standard way of defining this notion and not consistent with the KG-*in*dependent way how this notion is used in the rest of the manuscript.

* The definition of "the answer to P over G" in the paragraph after Def.3 guarantees only soundness but not completeness. What is missing is a statement that the set [[P]]_G contains all possible solution mappings that satisfies the given condition.

* The formalization of T[P] is flawed in two ways:
i) First, by writing "T ∈ μ[P]," T is a triple and not a "set of [...] triples."
ii) Second, the definition is based on a given solution mapping or a given set of solution mappings that constitute the answer to P over a KG G. However, this dependency on the solution mapping / on the KG is not reflected in the denotation T[P].

* Def.4 It is not clear why N_n (i.e., the third component of the 3-tuple) is parameterized by n, and I don't think that n is needed here.

* Def.5 The definition does not capture the condition that the fragments returned by such a function are meant to be disjoint.

* Def.5 I assume that the union of the fragments returned by such a function are meant to cover the given KG completely. This condition should be captured by the definition.

* Def.6 and Fig.1 The font used for N in the definition and in the caption of the figure should be made consistent.

* Def.7 The last part of this sentence ("M(t) returns the indexed nodes that have fragments holding data matching the triple t") should be expressed more formally.

* Def.8 What is ν'(t) supposed to return if the given condition ("if there exists a triple in f that matches t") does not hold?

* Def.8 What is "the index slice describing f"?

* Footnote 1 (Page 7): For (f ⊕ g)(x) = f(x), is there the additional condition that g is _not_ defined at x?

* ... and what about cases in which neither f nor g are defined at x?

* In the sentence that defines "a Bloom filter for a set S of IRIs" (second sentence on page 8), it is not clear what n is.

* Additionally, in the same sentence, I suggest to avoid overloading the meaning of the symbol S, which was used before as a symbol to denote a set of slices.

* Def.9 What are "the names of [...] IRIs"?

* ... and what are "the IRIs with prefix p_i"?

* Def.9 and Def.13 Shouldn't the "prefix-mapping function[s]" \theta and \theta_i be required to be injective?

* Def.17 The union symbol in the definition is parameterized with n but later in the paper it is not parameterized anymore and the sentence after the definition emphasizes why not. Therefore, I suggest to remove the parameterization also from the definition itself.

* Fig.14 The caption of the figure mentions watdiv100M twice. I assume that one of these mentions was meant to be watdiv1000M.