Review Comment:
This manuscript presents a next step that improves upon the authors' earlier work on partitioning and querying RDF datasets using peer-to-peer approaches. The contributions of this manuscript are i) a new technique for partitioning a dataset by considering Characteristic Sets, ii) a new indexing scheme that uses separate Bloom filters for subjects and for each predicate (indexing the objects associated with each predicate), and iii) a query optimization strategy that leverages the new indexing scheme. A comprehensive evaluation shows that these improvements outperform the baseline consisting of the authors' earlier systems (which have been the state of the art), and it also highlights weaknesses of the new improvements (namely, that their effectiveness for path-shaped query patterns is limited and that they may even have slightly worse performance for such queries).
The manuscript is well written (although it felt a bit dry and mechanical at times) and defines the proposed techniques and approaches in great detail. A running example illustrates the different concepts of the approaches further. I particularly appreciate the figures that visualize the steps of the cardinality estimations. In summary, the definitions, descriptions, and examples equip the reader with everything necessary to reproduce the authors' implementation of the approaches.
The evaluation of the proposal is comprehensive, including experiments based on a benchmark using synthetic datasets and queries as well as a benchmark with real domain datasets. The experimental setup is reasonable and described in sufficient detail. The evaluation provides a good understanding of the behavior of the proposed approach and how it compares to / improves upon the state of the art in terms of scalability (increasing numbers of peers issuing queries), impact of different types of query patterns, and network usage.
These are all positive aspects of the manuscript. The only minor weaknesses are that i) the authors completely omit the question of dataset dynamics (what happens if the dataset changes?), ii) the strategy for dealing with "fragments with infrequent characteristic sets" (end of Sect.4.2) is not entirely clear, and iii) the formal part of the manuscript contains a few minor inconsistencies and flaws (see below).
Therefore, I propose to ask the authors for a minor revision of the manuscript that addresses these weaknesses.
In particular, regarding the dataset dynamics, I would like the authors to include an explicit statement about the fact that the proposed approach focuses on static datasets. Regarding the strategy for dealing with "fragments with infrequent characteristic sets," I would like the authors to extend their description such that it answers the following questions:
* How do you decide which fragments to merge?
* ... and which ones to split?
* ... and how exactly to split them? (i.e., which predicates remain together?)
* When splitting a fragment, are the triples of that fragment duplicated into the other two fragments?
Moreover, the following minor issues in the formalization need to be fixed.
* Sect.3 It is not clear what "conjunctive triple patterns" are.
* Def.2 For the purpose of a complete formal definition, the notion of "subjects" of a BGP should be defined formal.
* Def.3 This definition introduces the notion of solution mapping as a concept that depends on a given knowledge graph (KG), which is not the standard way of defining this notion and not consistent with the KG-*in*dependent way how this notion is used in the rest of the manuscript.
* The definition of "the answer to P over G" in the paragraph after Def.3 guarantees only soundness but not completeness. What is missing is a statement that the set [[P]]_G contains all possible solution mappings that satisfies the given condition.
* The formalization of T[P] is flawed in two ways:
i) First, by writing "T ∈ μ[P]," T is a triple and not a "set of [...] triples."
ii) Second, the definition is based on a given solution mapping or a given set of solution mappings that constitute the answer to P over a KG G. However, this dependency on the solution mapping / on the KG is not reflected in the denotation T[P].
* Def.4 It is not clear why N_n (i.e., the third component of the 3-tuple) is parameterized by n, and I don't think that n is needed here.
* Def.5 The definition does not capture the condition that the fragments returned by such a function are meant to be disjoint.
* Def.5 I assume that the union of the fragments returned by such a function are meant to cover the given KG completely. This condition should be captured by the definition.
* Def.6 and Fig.1 The font used for N in the definition and in the caption of the figure should be made consistent.
* Def.7 The last part of this sentence ("M(t) returns the indexed nodes that have fragments holding data matching the triple t") should be expressed more formally.
* Def.8 What is ν'(t) supposed to return if the given condition ("if there exists a triple in f that matches t") does not hold?
* Def.8 What is "the index slice describing f"?
* Footnote 1 (Page 7): For (f ⊕ g)(x) = f(x), is there the additional condition that g is _not_ defined at x?
* ... and what about cases in which neither f nor g are defined at x?
* In the sentence that defines "a Bloom filter for a set S of IRIs" (second sentence on page 8), it is not clear what n is.
* Additionally, in the same sentence, I suggest to avoid overloading the meaning of the symbol S, which was used before as a symbol to denote a set of slices.
* Def.9 What are "the names of [...] IRIs"?
* ... and what are "the IRIs with prefix p_i"?
* Def.9 and Def.13 Shouldn't the "prefix-mapping function[s]" \theta and \theta_i be required to be injective?
* Def.17 The union symbol in the definition is parameterized with n but later in the paper it is not parameterized anymore and the sentence after the definition emphasizes why not. Therefore, I suggest to remove the parameterization also from the definition itself.
* Fig.14 The caption of the figure mentions watdiv100M twice. I assume that one of these mentions was meant to be watdiv1000M.
|