An Empirical Evaluation of Cost-based Federated SPARQL Query Processing Engines

Tracking #: 2464-3678

Umair Qudus
Muhammad Saleem
Axel-Cyrille Ngonga Ngomo
Young-Koo Lee1

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
Finding a good query plan is key to the optimization of query runtime. This holds in particular for cost-based federation engines, which make use of cardinality estimations to achieve this goal. A number of studies compare SPARQL federation engines across different performance metrics, including query runtime, result set completeness and correctness, number of sources selected and number of requests sent. However, although they are informative, these metrics are generic and unable to quantify and evaluate the accuracy of the cardinality estimators of cost-based federation engines. In addition, to thoroughly evaluate cost-based federation engines, the effect of estimated cardinality errors on the overall query runtime performance must be measured. In this paper, we address this challenge by presenting novel evaluation metrics targeted at a fine-grained benchmarking of cost-based federated SPARQL query engines. We evaluate the query planners of five different cost-based federated SPARQL query engines using LargeRDFBench queries across. Our results suggest that our metrics are clearly correlated with the overall query runtime performance of the selected federation engines, and can hence provide important solutions when developing the future generations of federation engines.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 23/Apr/2020
Major Revision
Review Comment:

I thank the authors for their reply letter and updated paper.
My comments regarding statistical assumptions have been adressed,
as well as the minor comments.
Unfortunately, my other concerns from previous review still hold (see below).

In general, I find the paper to be wrongly positioned.
The abstract and introduction primarily talk about the similarity metrics that are introduced,
and experiments that are performed with it.
However, as the authors say themselves, the correlation to execution times is weak to moderate,
which makes it usefulness questionable.
Furthermore, the paper does not really discuss the practical applicability of the metric (see below).
The paper does however include extensive experiments, which I find valuable.
Many of these experiments (section 6.4-6.7) are not directly related to this similarity metric.
Given the size of these sections, this shows that there is an issue with the positioning of the paper.
These experiments (section 6.4-6.7) should not be included to explain and evaluate the similarity metric.
Instead of removing these experiments from the paper,
I would encourage the authors to reposition the paper
so that it focuses on evaluation of cost-based federation engines (following the title),
with the similarity metric as one possibility.

I am not satisfied by the authors reply regarding the issue of robust regression.
The authors ignored my main point regarding the study of the weights that are assigned by the Hubert loss function,
and which outliers were modified/dropped by it.
In robust regression, it is important that the modified outliers are analyzed
to make sure that it is really desired that these outliers are modified.
Since the authors rely heavily on the results of this robust regression,
I strongly encourage the authors to include such an analysis of the robust regression process and its results.
In this case, I would expect an analysis of which types of queries in which engines are modified by the Hubert loss function.

In my last review, I mentioned that the conclusion section is rather weak.
The authors unfortunately made only minor changes here, so this concern still holds.
The authors replied to my answers in their reply letter,
but they did not include this information into the conclusion section.
Furthermore, their reply is mostly unsatisfying.
* Reviewer comment: Cardinality estimates are clearly not the only predictor, what else is? Can it be measured?
Author's response: ​Table [1] shows all existing metrics which are used in performance evaluation of the federated SPARQL query processing engines.
New reviewer response: Table 1 mostly contains metrics that can be used to evaluate performance after query execution, but these are not "predictors" that can be used during query execution.
* Reviewer comment: How will this metric be used?
Author's response: Used to build better cost-based federated SPARQL query engines
New reviewer response: This answer is very vague. Please elaborate.
* Reviewer comment: What impact will this work have?
Author's response: ​Provide new measures for the development of new and better cost-based federated SPARQL query engines
New reviewer response: Please include this in the conclusion section and elaborate.
* Reviewer comment: It feels to me like the new metrics are to be used as alternative to execution times.
Author's response: ​This is not the case. As mentioned before, execution time has been regarded as the central performance metric. The aim of the majority of the federated query engines is optimized the query runtimes. This metric is to be used as a proxy for runtime within cost-based federation engines to improve their planners.
New reviewer response: Why is such a proxy metric needed? Similarity errors seem much harder to calculate than total execution time, as it requires capturing more information during query execution within an engine. It is not clear to me why query engine developers would prefer the similarity metric over plain query execution times.
* Reviewer comment: Are there cases where execution times can not be used, but these new metrics can be used?
Author's response: Yes, during query planning.
New reviewer response: This sounds incorrect to me. The similarity metric requires knowledge of the real cardinalities, which are unknown during query planning (other wise the optimal plan would be easy to determine). Based on my understanding of the paper, it is not clear to me how this metric can be used during query planning.

Minor comments:
* Some places in the paper (such as abstract) still overstate the correlation between similarity metric and query exec time. This should be updated.
* Page 1: line 30: "across" should be removed
* Quite a few typos in section 5:
* Page 5: line 47: "The CostFed query planner"
* Page 6: line 5: "dataset"
* Page 6: line 27: "SPLENIDID"
* ...
* Odyssey is marked in bold in figure 2, is there a reason for this?
* Conclusion: "The proposed cardinality estimating metrics are generic and can be applied to non-federated 50 cardinality-based query processing engines as 51 well."
This paper shows no proof that these metrics can be applied to non-federated query processing engines as well, so this is an unfounded claim.

Review #2
Anonymous submitted on 14/May/2020
Review Comment:

The paper addresses the problem of measuring the accuracy of the cardinality estimators of federated SPARQL engines. The paper proposes similarity error metrics, inspired by the q-error metric introduced in databases. The paper measures the correlation of the value of the novel metrics with the overall query runtimes. An empirical evaluation is carried on the state of the art cost-based SPARQL federated query engines. Experimental results show that the proposed similarity-based errors have a positive correlation with the query runtime.

The current version is a major revision of the paper.
I thank the authors for the deep improvement of the quality of the paper. I am satisfied with the new version of the paper, it answers all my questions and integrates all my suggestions.

I have minors’ comments:

Page 5, line 45, Costfed -> CostFed
Page 6, line 13, In the definition: avgOS(p; D) is the average subject selectivity of p in D. subject must be replaced by object.
Page 6, line 48, add . after respectively
Page 8, line 8 add space after estimated
Page 15, line 47 remove space after )
Page 17, lines 45- 51: formats problems
Page 18, lines 18-21: What do you mean by: “Another important factor that is worth mentioning here is that the number of transferred tuples does not consider the number of columns in the result set”

Review #3
By Stasinos Konstantopoulos submitted on 24/May/2020
Minor Revision
Review Comment:

The submission proposes a similarity metric that evaluates the accuracy of the predictions that a query optimizer makes about the cardinalities of triples and joins. As these predictions are the primary input for estimating cost, erroneous cardinality predictions can lead the optimizer to sub-optimal query plans.

The proposed similarity metric is compared against q-error, a metric for evaluating cardinality predictions from the relational database literature. The basis of comparison is how strongly the evaluation metric correlates with actual execution times on a benchmark suite. That is to say, the proposed similarity metric is found superior to q-error in the sense that the new metric is a better predictor of overall query execution performance than q-error. Such a metric can be used (in combination with other metrics evaluating other steps of the query processing pipeline) in order to pin-point the step that causes ill-performance.

This work can potentially be a significant contribution to the federated query processing literature, and the submitted manuscript is a substantial contribution towards materializing this potential.

Regarding the authors' reaction on my comment regarding taking index size into account, I do not agree with their rationale for not involving this information as part of the metric (explanation follows), but it is their metric and their article, so AFAIIC the substance of the comment has been addressed as they now report index size and index construction time.

For the sake of academic discussion only, not be understood as a review comment, let me explain my argument by reduction to absurdity: there is nothing preventing a federation engine from copying the complete datasets and calling this copy the "index", internally trying out all sub-queries on all datasets, and then proceeding to plan the "real" execution based on perfect cardinality "estimations". None of the engines tested does this, but in my opinion the metric is weakened by the fact that it can be gamed in such a straightforward way. In fact, although the default behaviour is the one correctly described in Section 5, the Semagrow index can be configured to also include multi-dimesional join cardinalities. The index extractor is given as a parameter how much space it is allowed to consume and it decides how to use this space budget prudently, only keeping the most surprizing selectivities. Naturally it is not advisable to use too large an index as searching through the index would end up taking longer than executing even the most suboptimal plan, but it would game the cardilatity estimation metric. So a metric that combines cardinality accuracy with index size would be a useful tool for tuning systems with a variable-size index.

Regarding the authors' comment that index size depends on the format used to store the index, I am sure that "size" can be defined in a way that abstracts to information granules without being sensitive to the specific format.

My other comment has to do with the level of detail in the analysis of the experiments' outcomes. Although the second version of the article has become considerably more technically detailed and interesting, I think that some improvement is still in order: Empirical results do not go back to the great overview the authors are now giving in Section 5 to explain what characteristic of each of the five cardinality estimation strategies do or do not work well for the different dataset/query situations explored by the benchmark. In other words, if the new metric is compared against q-metric not only statistically, but also using the hints given by the statistics to go back and demostrate what actual phenomena q-metric fails to capture but the new metric does. This would be a blueprint on how to use the metric to discover and fix problems in federation engines, demonstrating its potential and increasing our confidence that it is capturing something that q-metric does not.

Given the above, I find the authors' response to the relevant comment in the first review satisfactory, as they added a very interesting analysis of how better cardinality estimation leads to better query plans. What is missing, editorially rather than substantially, is a concrete demonstration of how the new metric can be used in a way that q-metric alone would be insufficient. This would evaluate the first contribution claimed by the authors: the new metric can give a fine-grained evaluation of query engines.

Another minor editorial in Section 4.1, Definition 1:
(r_1, ... r_n) and (e_1, ... e_n) are in R^n, not R