Online approximative SPARQL query processing for COUNT-DISTINCT queries with Web Preemption

Tracking #: 2842-4056

Authors: 
Julien Aimonier-davat
Hala Skaf-Molli
Pascal Molli
Arnaud Grall
Thomas Minier1

Responsible editor: 
Guest Editors ESWC 2020

Submission type: 
Full Paper
Abstract: 
Getting complete results when processing aggregate queries on public SPARQL endpoints is a challenge, mainly due to the application of quotas. Although Web preemption allows processing aggregate queries online, on preemptable SPARQL servers, data transfer is still very large when processing count-distinct aggregate queries. In this paper, it is shown that count-distinct aggregate queries can be approximated with low data transfer by extending the partial aggregation operator with Hyper-LogLog++ sketches. Experimental results demonstrate that the proposed approach outperforms existing approaches by orders of magnitude in terms of the amount of data transferred.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Oscar Corcho submitted on 20/Aug/2021
Suggestion:
Accept
Review Comment:

I am very happy with the way in which the large set of comments that I provided in the review of the original submission have been dealt with by the authors. I think that the paper now reads better and especially that it is more accurate in all the claims that are being made (especially in terms of setting explicitly the limitations of the work that has been done and being more clear about the advances done and next steps).

The new experiments added also improve the support for the main claims and conclusions.

Therefore, I consider this paper ready for publication.

Review #2
By Aidan Hogan submitted on 21/Aug/2021
Suggestion:
Accept
Review Comment:

## Response to letter

Overall, I am generally satisfied with the authors' response to my comments. As some remaining observations:

** I thank the authors for clarifying what they mean by "full-mappings" operators (which is based on full-relations operators from the book by Garcia-Molina et al.). In general, I still see this as a practical/informal distinction that relates more to "physical operators" (how the operators are implemented) than "logical operators" (how they are defined). I think there are ways to implement "full-mappings" operators in a "tuple-at-a-time" ((index) nested-loop) fashion, which grays the line between both types of operators. As some minor related comments:

- "For example, the ORDER BY is a full-mappings operator as it needs to materialize all the mappings before sorting them." I think this needs to be revised as the mappings do not always *need* to be materalised or sorted (as discussed a little later).

- "However, in the general case, queries that use a full-mappings operator are not preemptable." Again, while I understand the intuition, this is a very specific claim that I don't see evidence for. For example, if we have the sorted indexes available, I do not see how preempting ORDER BY (in the general case) is any more difficult than preempting joins. Note, for example, that enumerating mappings in order is precisely how many worst-case-optimal join algorithms work. Unless you have a proof for this, I would suggest to simply weaken the claim to suggest, e.g., that in such cases, preemption is "not straightforward", or that it has not yet been explored in detail, or something similar, rather than suggesting it cannot be done.

- "hence it cannot be suspended and resumed in constant time." Again, this claim is strong, not proven, and should either be weakened, or a proof provided.

** I thank the authors for clarifying the combination of HLL and LinearCounting in HLL++ to deal with both small and large cardinalities, and also for adding error rates.

** I think that other comments have been addressed appropriately.

## Minor comments

- "allows processing" -> "supports processing"
- "Public endpoints such as DBPedia or Wikidata [support]"
- "registers [de]noted"
- "two third[s] of the queries"
- "estimator to estimate the result" -> "estimator for the result"
- "that [supports] estimating a distinct sum"

## Recommendation

I think this is an interesting paper, and the authors have clarified/addressed a number of key concerns regarding error rates, limitations, scalability, and readability. I think that the remaining fixes (including weakening some informal claims) relate to language, and thus the paper can be accepted.

Review #3
By Stasinos Konstantopoulos submitted on 09/Sep/2021
Suggestion:
Accept
Review Comment:

The submission presents a method and prototype for computing approximations of count-distinct aggregate queries against servers that apply Web preemption to ensure responsiveness and fairness between their clients. Web preemption is likely to make it impractical to get count-distinct results against larger datasets, as (a) such queries are likely to not be able to be computed within a timer slot; and (b) receiving all results to perform the aggregation on the client side wastes network resources when only aggregate results are needed.

The submission is well written. The text presents ideas and results clearly and at the right level of technical detail.

The submission substantially extends the ESWC 2020 publication that presented a method for preemtable agregation, excluding the DISTINCT operator. As handling grouping by distinct keys is a substantially memory-intensive operation, there is significant added value (also reflected in the text) with respect to the ESWC 2020 publication.

The proposed extension uses HyperLogLog, a prior algorithm, to approximate counts with a limited memory budget. This allows Web preemption to maintain state between slots allocated to the same query with minimal overhead.

The idea was experimentally tested on BSBM and DBPedia. The method delivers as expected, drastically reducing the tranferred data.

My comments from the first review round have been addressed.