Review Comment:
## Summary
The paper discusses methods by which aggregation queries can be made (efficiently) preemptible, meaning that the query can be paused and resumed efficiently, allowing it to be processed in discrete quanta (e.g., each initiated by individual request). Though previous work has tackled the preemption of triple patterns, basic graph patterns, etc., the case for aggregation is considerably more challenging. The authors first propose a preemption framework that is similar to how distributed systems (such as Hadoop or Spark) process aggregations, decomposing the aggregation into a combiner function and a reducer function. This allows the base solution mappings to be processed in chunks on the server using the combiner, with the final result being computed by the client using the reducer. However, in the case of DISTINCT COUNT or DISTINCT AVG, a meaningful combiner is not possible, and thus, by default, the server would have to stream all values to the client, who would then have to apply the DISTINCT operation and the COUNT/AVG. Specifically for the COUNT DISTINCT case, the authors propose an approximate method based on HyperLogLog (HLL), which uses the maximum number of trailing zeros in different buckets of values in order to estimate the overall cardinality. This ensures that cardinalities can be estimated in constant space within a given error guarantee.
The proposed methods are then used to extend an existing preemption-enabled query engine, called SaGe, and experiments are run evaluating aggregation queries over a variety of datasets. The systems included in the comparison are SaGe, Virtuoso, TPF, as well as SaGe extended with both forms of preemptible aggregation (exact and approximate). The results show that although Virtuoso provides the fastest runtimes and lowest data transfer overall, it would reach the timeout limits typically defined on public endpoints for many queries. The next best alternatives are the extensions of SaGe with the proposed aggregation frameworks, which though slower than Virtuoso, support pausing and resuming queries efficiently. The approximation framework based on HLL does not improve runtimes, but can reduce the amount of traffic when evaluating COUNT DISTINCT queries.
## Strengths:
S1: SPARQL endpoints have become a key part of the Semantic Web infrastructure (e.g., Wikidata receives millions of SPARQL queries per day), and their performance issues are well-known. Query preemption is a promising strategy to improve their reliability by allowing to break a query down into quanta of processing. Likewise aggregation queries are very common, and are often among the most challenging queries for SPARQL endpoints to process (as they often require massive intermediate results to produce a concise result). Hence, I think that preemption for aggregate queries is an interesting and natural topic to explore.
S2: The techniques proposed are promising, both in terms of exact aggregation and approximate aggregation.
S3: The technical content is presented both formally and with examples, which makes them relatively easy to follow. (With the caveat that some of the definitions do need some polishing, as discussed later.)
S4: Experiments consider a reasonable selection of alternatives that cover examples from each of the most prominent families of approaches. A range of datasets are considered.
## Weaknesses
W1: There are some claims in the paper regarding certain query operators being "full mappings" operators. From my understanding, this refers to the idea that all of the mappings must be computed before the operator can be applied, and are thus not directly preemptible. The authors include ORDER BY as an example, along with aggregation functions. But it is not very clear to me what precisely this concept of a "full mappings operator" means, or what the relation is to preemptibility (intuitively I understand the rough idea, but formally I am not convinced about the details). It further does not justify why these operators necessarily require the full mappings to be computed before they can be applied. Is this just in some cases, or in all cases? For example, ORDER BY can sometimes be evaluated by "streaming" solutions from an index with the appropriate order (in fact, worst-case optimal join algorithms rely on this property of being able to stream the results for any variable of a BGP in sorted order assuming every triple pattern of the BGP contains that variable in time proportional to the number of results). Likewise, in the case of aggregation, it seems to be that one can apply preemption in a similar manner to joins, processing the aggregation key by key. This is, in fact, precisely how distributed frameworks, and the framework proposed by this paper, work. Put another way, for a given value of K, LIMIT K queries over ORDER BY (even on multiple columns) and aggregation queries (assuming non-trivial groupings) can be computed without computing the full base multiset of solution mappings, assuming we have complete indexes. For ORDER BY, we can generate an ordered list of K candidates for the first variable in the order expression, and then compute the rest of the query in a nested-loop fashion, stopping when we produce K solutions. For aggregate queries, we can again generate a list of candidates for K group keys, and then compute the rest of the query in a nested-loop way, generating the full results for each key, which we can then aggregate. (For aggregate queries without GROUP BY, it may be necessary to compute all the solution mappings, I am not sure.) In any case, I think these claims surrounding "full mapping operators" need to be clarified as they form a core part of the motivation for the paper.
W2: Though not previously familiar with the HLL approach, I wonder if it is really suitable for the proposed use-case. It would seem that HLL is better suited to estimating the cardinality of large sets, for which its size is quite small, and the estimation based on leading zeros can give relatively low errors, but often in the case of SPARQL, when grouping by keys, the sets (with distinct) can be small. In the case of having many small sets (when counting by groups), HLL would seem (intuitively) to give coarse results and to use quite a lot of space (1.5 kB per group key), which might even be larger than storing a small set. Unfortunately, the paper does not provide any reason to suggest that HLL is a good approach for the group-by aggregate case. The experimental results do not measure the actual error of the counts generated, and no other similar alternative is considered.
W3: The paper could be much clearer about the limitations of the proposed approach. Some limitations are explicitly discussed, some are alluded to, and some (such as mentioned in W2) seem not to be discussed. I would like to see a more cohesive discussion (in one section or sub-section) of all of the limitations, before the experimental results are presented. These limitations might include: the space used by HHL, the error rate of HHL for smaller cardinality sets, the need to still stream all values for AVG DISTINCT, having to count over many keys, the lack of join and filter optimisations in the current implementation, the issues of scalability, etc. For each limitation, it would be good to clarify why it occurs (e.g., is it a limitation of the expressiveness of SPARQL, a trade-off associated with the proposed approach, or just an implementation issue), and how it might be addressed.
W4: There are a couple of key aspects of the experimental results that I found unclear:
- With respect to scalability, 100 million triples is not that much, where the experiments report results over a *fragment* of DBpedia v3.5.1. Why was a fragment chosen? Would it not work on the full datasets? Why not?
- Perhaps I am mistaken, but wouldn't the data transfer be guaranteed to be significantly larger for SAGE-approx than Virtuoso in SP since the former must send the HLL registers (1.5 kB per group key?), which will be significantly larger (by a constant factor) than returning just a simple count?
- As mentioned before, I miss discussion of the actual error observed when using HLLs. I suspect this might be larger than what the parameter suggests it should be (my guess is that the error rate is defined approaching infinity, and does not necessarily apply to small sets?).
W5: While the paper is quite readable, there are frequent language issues and bugs in the notation. I will list some of these below, but the paper would benefit from a more thorough proof-read.
## Recommendation
Overall I think the work addresses an important problem, and proposes interesting techniques. However, I would like to see the paper revised in order to address the aforementioned weaknesses, specifically:
- W1: to clarify the claims surrounding "full mapping operators" (or perhaps remove them entirely if not essential to the paper);
- W2: to present the observed errors associated with HLL in the experimental results (optionally adding a similar baseline);
- W3: to gather together and discuss all of the limitations of the work before the experimental section;
- W4: to clarify why larger datasets were not used in experiments, and why the data transfer with HLL is often competitive to that of Virtuso;
- W5: to more carefully proof-read the paper (see minor comments below).
As these are quite significant changes that might affect my overall impressions of the work (considering W2 and W4 in particular), I recommend a Major Revision.
## Minor comments
General
- "In a previous work [6], it has been demonstrated" Since the submission is not double-blind, I think it would be more informative and natural to point out that this work is yours: "In our previous work [6], we demonstrated".
- "on [the] client side" (various examples)
- "preemptable" would rather be "preemptible"? The latter seems to be a more common spelling. (On checking, I guess that either is fine.)
- Make sure that there is a space before \cite{...} in every case.
- The colours in some of the figures are slightly too dark to be able to comfortably read the black text in small font, particularly the blue colour.
Abstract:
- "to quota[] enforcement"
- "Although [] Web preemption"
Introduction:
- "requires [materializing] the [query] mappings on [the] client side"
- "[Even] if the processing"
- "Compared to related Approximated Query Approaches [17], this approach ensures to find all group keys in a single pass with a pre-defined error rate for all values." I did not find this difference or benefit clear (at this point). I think this is a key sentence, so I would suggest to clarify, with an example if necessary.
- "The remainder of this paper ..." Section 6 is missing.
Related Works:
- "a lot of CPU, memory, [and] temporary storage [resources]" Also it seems redundant to include temporary storage and memory as the latter includes the former?
- "user[] community"
- "endpoints set[ ]up quotas" (when a phrasal verb it is two words)
- "Query results are then recombine[d] on [the] client-side"
- "This requires transfer[ing] ... and then [computing]"
- "computing the exact results for aggregate queries and approximate values for count-distinct queries" Are count-distinct queries not aggregate queries? (The phrasing is a little confusing.)
- "for count-distinct queries, ensure that all group keys are collected with guaranteed accuracy on values" What does guaranteed accuracy mean here, more precisely?
- "Amazon[, etc.,]"
- "that require counting"
Preliminaries:
- "let [us] consider"
- "RDF terms such [that]"
- "An RDF graph G (called also RDF dataset)" It is true that some papers call RDF graphs RDF datasets, but the RDF 1.1 standard defines RDF datasets in terms of named graphs, not RDF graphs, so I would drop the "(called also RDF dataset)" part.
- "Let [us] assume"
- "Mappings \mu_1 and \mu_2 are compatible on the variable ?x ..." This definition is a bit atypical. In this case, why not simply write $\mu_1(?x) = \mu_2(?x)$? Compatibility is not typically defined with respect to a variable, but rather between two mappings: \mu_1 ~ \mu_2. Also (if I am not mistaken) the definition is never used, so you could just drop it?
- "multiset[] of solution[] mappings"
- "with lists of expressions E = [E_1" (missing subscript)
- "E is restricted to variables, without reducing the expressive power of aggregates" I presume that this is only with the assignment of variables, which is not included in the defined fragment of SPARQL.
- Definition of group: "G an RDF graph", but the RDF graph is not used in the construct.
- "Let[ting] $\Omega = ...$"
- Definition of aggregate: "G an RDF graph", but the RDF graph is not used in the construct.
- Definition of aggregate: "a multiset of partial functions" Before it was defined as a set (and it should be a set as the keys are unique?).
- "that finally sort[s] them."
- "[Although] delegating ... allows [us to] effectively [] process"
- "In this example, Q_1' requires six quanta to complete" Why six quanta? Is this assumed for the purposes of the example or do we have some way to deduce that six quanta are needed?
Computing Partial Aggregations with Web Preemption
- "the way [in which] the property"
Count-Distinct SPARQL Aggregate Queries
- "over several quanta, as [per]"
- "requires [] transfer[ing]"
- "and [] bounded memory"
- "typical accuracy of 2%" Typical error? Otherwise it is not clear what accuracy means here.
- "This problem [is known] as the many-distinct count problem"
- "Thanks to [] web preemption"
- "In the following ... partial aggregations." Poorly structured; I would suggest to rephrase.
- In the definition of the HLL set, p is used to denote the precision (which I would understand to be a real value), while later it is used as an integer (m = 2^p). I do not (at this point) understand what a precision of (e.g.) 10 means here. The notion of precision needs to be defined more clearly at this point, rather than where it is currently defined. Also it would be good to provide the formula for computing the error rate given p. From the example given, it would appear to be something like (1 + p/100) x \sqrt{2^p}, which seems a bit strange.
- "an array R of m = 2^p registers" It would be good to indicate what kind of value (how many bits) each register contains.
- "represents the ind[ex]"
- "64 bit[] hash value"
- "[Long] quant[a] [reduce] data transfer as HLL-sets are better used."
- "the client memory is no [longer] a shared resource"
Experimental Study
- "What is the impact of [quantum duration] on the data transfer"
- "count distinct queries [reduce] data transfer?"
- "[]Python SAGE-AGG and SAGE-APPROX clients ... with [] Algorithm 2."
- "are [run] on synthetic and real-world datasets"
- "quer[y] workloads"
- "Virtuoso without quota [performs the best] in terms of"
- "many [HTTP] requests"
- "quer[y] execution time"
- "client-side [like] TPF"
- "red dotted line in [Figure] 8"
- "after [a] time quantum"
- "for each group key per quantum is bound[ed]"
Conclusion and Future Works
- "elements from [the] server to [the] clients"
- "in the experiment[s]"
- "could [reduce] the data transfer"
References
- Be sure to enforce capitalisation for acronyms (e.g., SPARQL), proper nouns (e.g., Virtuoso), etc., in the paper titles.
|