Review Comment:
The submission proposes a similarity metric that evaluates the accuracy of the predictions that a query optimizer makes about the cardinalities of triples and joins. As these predictions are the primary input for estimating cost, erroneous cardinality predictions can lead the optimizer to sub-optimal query plans.
The proposed similarity metric is compared against q-error, a metric for evaluating cardinality predictions from the relational database literature. The basis of comparison is how strongly the evaluation metric correlates with actual execution times on a benchmark suite. That is to say, the proposed similarity metric is found superior to q-error in the sense that the new metric is a better predictor of overall query execution performance than q-error. Such a metric can be used (in combination with other metrics evaluating other steps of the query processing pipeline) in order to pin-point the step that causes ill-performance.
This work can potentially be a significant contribution to the federated query processing literature, but the submitted manuscript needs substantial improvements to materialize this potential.
Substantial comments:
One comment is that the positive correlation between the similarity metric and the query runtimes was only clearly visible after applying regression to treat outliers. The authors also mention noteworthy outliers (see p. 12, left column, lines 5ff) where a significant error in cardinality estimation had no impact on query execution, because the numbers worked out so that the optimizer selected the optimal plan anyway. This instance (and, one assumes, many if not most of the outliers) is what the proposed similarity metric is promising to capture by normalizing the error with the Euclidean length of the cardinalities vector (see the denominator in the definition of the similarity metric).
The authors should investigate these instances and provide an error analysis where they explain why this normalization failed to capture how the cardinality estimation errors cancelled each other out and the correct plan was produced. Such an analysis can lead to insights about weights and other parameters in the definition of the metric, aiming to reflect how errors in the same direction in all cardinalities tend to cancel each other out. Even if the authors cannot devise an appropriate definition in the context of the work presented in this article, their analysis can guide future work.
Another omission is that the effect of index size is not investigated. Especially when it comes to estimating join cardinalities, the actual cardinalities can vary wildly depending on selectivity and cannot be predicted from individual triple cardinalities. This makes multi-dimensional indexes accurate, but multi-dimensional indexes are also voluminous and maintaining and querying the index can become a significant overhead. From this perspective, I suggest that instead of being a single value, the metric should be a plot of the current metric against index size (reducible to a scalar as AUC or similar, when necessary).
My final comment is that the submission does not evaluate the very first contribution claimed by the authors (p. 2, left column, lines 44ff): we do not see how the new metric can give a fine-grained evaluation of query engines. For this, I recommend that the authors identify in their query suite those queries where erroneous cardinality estimation has led to sub-optimal performance, and compare the similarity metric against metrics evaluating other aspects of the query processing pipeline (p.2, left column, lines 13-21). This will demonstrate how the similarity metric correctly identifies cardinality estimation error as the reason for the sub-optimal overall performance.
Editorial comments:
The paper is overall clear and well-written, with only a few editorial comments:
p2. 2, left column, line 21ff:
I do not understand the juxtaposition to Acosta et al. [24]. The cited paper does not address measuring the accuracy of the cardinality estimators; it is as relevant as any paper on evaluating query processing and as such I do not understand why it is being singled out. I urge the authors to revise this passage in order to avoid creating the (false) impression that they are referring to prior work on the same topic as what they are handling.
p. 2, left column, line 9:
"principle input": I think you mean principal
Section 4:
In all definitions, (r_1, ... r_n) and (e_1, ... e_n) and in general all these vectors are in R^n, not R
Also, it is better that you use \cdot and not * to denote the usual, arithmentic multiplication.
Finally, in Definitions 2 and 3 you are using |v| to denote the Euclidean norm instead of the generally used ||v||_2 notation.
Section 5:
It feels a bit odd that this brief introduction to the federation engines used in the experiments is elevated to a full section. I would only see a reason for having a full foreground section on the used engines if it included a full analysis that provided new, foreground knowledge. This new knowledge could be, for example, an analysis of these engines' innner workings for the purpose of providing prior theoretical motivation for the proposed q-error and similarity metrics; or an analysis that guides the reader on how to best read and interpret the empirical results that compare cardinality errors and query execution times for these engines. As things stand, the authors seem to have more pragmatic rather than theoretical reasons for their selection (see p. 6, left, lines 39-44 and p. 7, left, lines 16-19). These are perfectly valid reasons to select these systems, but I recommend moving the content currently making up Section 5 into the experimental setup in Section 6.1 or into the background in Section 2.
Figures 2 and 3:
The caption should explain two to interpret the gray area in the graphs.
p. 12:
"SPLENDID's average runtime is 98,494.52ms" and in other places: This is an abundance of significant digits, not necessary for the argument. 98sec would suffice for the comparisons you are making, 98.5 sec would already be slightly more than needed; 1/100th of a ms is really overkill and, for that matter, I suspect below the granularity that can be reliably measured.
|