Review Comment:
Summary of the paper:
The authors tackle problem of scaling up to large federations of RDF datasets during SPARQL query processing, and propose x-Avalanche a SPARQL query engine for Web-accessible RDF data. x-Avalanche extends Avalanche, a previous work by the same authors. x-Avalanche is empowered with the capability of traversing search spaces of plans that include disjunctions of sources and produce query plans that minimize time to generate the first and total results. Plans correspond to bi-partite graphs and are represented as plan matrices. A dynamic programming based algorithm is proposed to solve the problem of finding execution plans that correspond to reductions of the plan matrices. A cost model is defined to guide the optimizer into the space of low estimated cost plans. A union operator is defined to merge the data retrieved from the selected sources in a distributed fashion; the proposed union operator relies on a master-slave federation of endpoints, where master endpoints are able to union the data produced by the slaves.
Furthermore, SPARQL endpoints are wrapped by proxies, which are able to join and cache intermediate results. Performance and scalability of the approach is empirically evaluated against data sources generated using the LUBM data generator. Queries are generated using the Waterloo SPARQL Diversity Test Suite (WatDiv). Experimental results suggest that the proposed query optimization techniques are able to identify plans that improve the performance of a federated query engine.
Overall the paper is well-written and tackles a relevant problem in the area of data management. Nevertheless, because of the lack of formalization, the reported results cannot be reproduced or generalized. The main drawback of this work relies on the fact that the authors propose computational techniques and claim optimality of the solution of an optimization problem that is not defined. Similarly, basic concepts are ill-defined. For example, definition 3.1 states that a plan fragment is a query plan where only some sites are considered, but neither a query plan nor the site is defined. Furthermore, nothing is said about the properties of a query plan, e.g., if it is a digraph or tree, or how “sites” are represented in a plan. The definition of fragmented bushy plan suffers of similar problems, and the concepts of plan matrix and plan matrix reduction are not presented. This situation impedes the formal demonstration of the properties of the proposed algorithms, e.g., complexity, conditions of optimality, and completeness and correctness of the answers.
The experimental evaluation is also misspecified, and the experimental design is not statistically justified, e.g., the selection of the parameters that impact on the behavior of the approach are not statically evaluated. Finally, the baseline federated query engine does not implement operators or cost-based optimization techniques comparable to the ones implemented by xAvalanche, preventing thus the generalization of the reported results to the rest of the existing federated query engines, e.g., HiBISCuS-FedX, SPLENDID, or ANAPSID.
Pros:
+ A query planner able to traverse the space of query plans that comprise sub-plans executable against several SPARQL endpoints.
+ Design of physical operators which allow for the execution of join and union operators in a distributed fashion.
+ A distributed caching system able to enhance performance of SPARQL endpoints.
+ Formal properties of the proposed approach are discussed in terms of time complexity analysis.
+ Empirical evaluation of the properties of the proposed approach using existing testbeds and one state-of-the-art engine.
Cons:
- Basic concepts are not formally stated.
- The problem of partition-aware union grouping is not defined, so it is not possible to determine the optimality conditions of this problem.
- Intractability of the problem of identifying efficient query execution plans in the studied space of plans is not analyzed. Similarly, complexity of evaluating plans with unions is not discussed.
- Optimality of the approach is claimed without any formalization of the solved optimization problem. Additionally, characteristics of the instance of the problem that ensure optimality of the proposed solution are not stated or demonstrated.
- The configuration of the experiment is not explained in detail, e.g., characteristics of the SPARQL endpoints, number of buffers assigned to the SPARQL endpoints, maximum memory use. Parameters that impact on the performance of the approach are not statistically demonstrated.
- Behavior of the proposed query optimizer tasks in presence of general vocabulary predicates such as rdf:type, owl:sameAs, is not analyzed.
- Reported metrics are not precisely defined, e.g., time for the total and first results.
Detailed comments:
Abstract and Introduction
The authors claim that xAvalanche is able to explore a search space unexplored so far, and do not make any reference to the plans produced by HiBISCuS-FedX, or ANAPSID. Both approaches are able to execute sub-queries against sub-sets of SPARQL endpoints. To support this statement, the explored search space have to be clearly defined, as well as the feasibility of assessing this goal in presence of queries with a large number of triple patterns, e.g., queries as the ones reported by Sahoo [Sahoo2010] and Loizou et al. [DBLP:journals/ws/LoizouAG15]. Additionally, the proposed approach should be compared to unfolding strategies implemented by Global As View integration frameworks, where the union of all the rewritings (query plans) corresponds to a plan of the original plan and global views correspond to entries in the plan matrix.
1.2 Contributions:
Optimality of the proposed approach is stated as one of the contributions of the paper, but no formal definition of the optimization problem is presented or proof of the optimality of the solution is reported. Unfortunately, this statement cannot be supported with the results reported in this current version of the paper.
The proposed physical operators are not defined, so there is no evidence, theoretical or empirical, that supports the statement: “…. the union operator scalable to hundred or thousand of endpoints…”.
The proposed caching strategy attempts to mitigate the high latency of the SPARQL endpoints; nevertheless, no experimental evaluation is reported to back this claim.
2.1 Related work
Adaptive query engines do not rely on statistics to adjust execution schedulers. So it seems inconsistent with the fact that Avalanche implements an inter-operator adaptive query execution strategy, while it relies on statistics during query optimization. This point needs to be clarified.
3. Optimization
Bushy plans have shown to be more suitable than left-linear plans for SPARQL queries. The selection of left-linear plans have to be justified, and the performance of the proposed approach empirically compared with federated engines that do implement bushy plans, i.e., SPLENDID or ANAPSID.
3.2
Definitions of the plan matrix and the plan matrix reduction have to be included. Also, the optimization problem has to be defined, as well as the conditions to be me by the instance of the problem to ensure optimality.
Complexity analysis of the proposed solutions is presented without any proof. Further, correctness and completeness of the proposed algorithms are not stated.
4. Scalable Distributed Unions
Completeness of the results produced by the union operator is claimed without any formal proof or empirical evaluation. Similarly, properties of the operator that allows for scaling up to large number of endpoints have to be backed up.
Eliminating duplicates from the intermediate results prevents xAvalanche to implement the semantics with SPARQL queries without the DISTINCT modifier.
Because, all the queries evaluated in the experimental study did not include the DISTINCT modifier and FedX does produce duplicates, the query results produced by both engines may differ, impacting this difference in the performance of FedX.
5. Distributed State Management and Caching
Explain in more detail the properties of the proxy and the caching strategies.
6. Evaluation
Explain in more detail the experiment configuration. Which are the hypotheses of this experimental study? Are these hypotheses validated or falsified?
Claims about optimality of the plans found by the proposed optimizer have to be backed up.
Because xAvalanche is empirically evaluated against a federated engine that implements optimization strategies incomparable to the ones provided by xAvalanche, the reported results cannot be generalized to the rest of the federated query engines. Engines as HiBICus-FedX, ANAPSID, or SPLENDID should be included in the study.
Questions to the Authors:
Q0) Definition of plan fragment is misleading and should be defined inductively. Can the subset of sources be empty? What is the shape of the plan? Is a plan fragment a logical or a physical plan? According to the authors (page 5) only left-linear plans are considered. So, is a plan fragment a left-linear plan?
Q1) What is the complexity of the problem of deciding if a reduced plan matrix is an optimal solution of the partition-aware union grouping? When a plan matrix has an optimal reduction?
Q2) Plans defined in Definition 3.2. correspond to the union of left-linear plans, so calling these plans fragmented bushy plans is misleading. Furthermore, the authors claim that the space of the fragmented bushy plans is smaller than the space of bushy plans. The authors should include expressions that count both types of query plans to support this statement.
Q4) The authors claim that fragment bushy plans are variant of bushy plans; note that also left-linear plans are variant of bushy plans. Formulas expressing the number of left-plans included in a fragment bushy plan should be included to understand feasibility of applying this approach in real-world queries with a large number of triple patterns.
Q4) Time complexity of the optimizer that explores the extended planning space is presented without defining the optimization problem that is solving. What are the properties of the algorithm that has this time complexity? Does this algorithm solve the partition-aware union grouping problem?
Q5) What is a master endpoint? Can the authors illustrate in the existing SPARQL endpoints which of them are master endpoints? The same questions for slave SPARQL endpoints.
Q6) Why parallel tree-unions are computed? How a fragmented bushy plan is rewritten into a parallel tree-union?
Q7) What are the heuristics that support the generation of parallel plans as the ones reported in Figure 6?
Q8) What are the features of a proxy? Could the authors provide examples of existing proxies on top of existing SPARQL endpoints?
Q9) What are the main caching techniques implemented by xAvalanche? What is a record (unit in the cache) of an RDF document?
Q10) The characteristics of the benchmark that impact on the behavior of xAvalanche are presented without any justification or proof. Can the authors provide statistical evidence that these parameters have a significant impact? For example, the number of triple predicates on general vocabulary predicates such as rdf:type, owl:sameAs, is not mentioned. So, what would be the impact of these types of predicates?
Q11) The authors claim that horizontal partitioning schemas are a natural fit for federations of RDF data. Provide examples of existing RDF data sets on the LOD cloud that meet this property.
Q12) What will be the impact on having vertical partitioning schemas or hybrid?
Q13) What are the characteristics of the evaluated queries, i.e., number of triple patterns, number of joins, and unions? Queries with the DISTINCT modifier need to be included in the testbed as well as queries with a large number of triple patterns. Please note that in real world scenarios queries may have more than 50 triple patterns [Sahoo2010, DBLP:journals/ws/LoizouAG15]. Please include queries with more than 10 triple patterns, e.g., use real-world queries as the ones reported by Loizu et al. [DBLP:journals/ws/LoizouAG15].
Q14) Please cite existing federated query engines that implement a serial execution of the union of fragments? The serial execution of the union of fragments corresponds to the worst scenario that the parallel execution may trivially overcome. Adaptive implementations of the union operator provided by ANAPSID will allow for a more fear comparison of the proposed union operator.
Q16) Reported metrics are not defined. Please, define total time, best time, as well as the methods used to compute these metrics.
Q17) Explain how the fragments were distributed along the SPARQL endpoints; report on the number of SPARQL endpoints available and how the endpoints were distributed in the 11 machines.
Q18) FedX is implemented in Java while xAvalanche is implemented in Python 2.7.8. Please, explain how the clocks used to measure the time execution were set up to ensure precise values.
Q19) Neither FedX nor xAvalanche implements adaptive inter- or intra-operator techniques that ensure the generation of answers in an incremental fashion. Could the author explain the characteristics of FedX and xAvalanche that justify the different behaviors exhibit by these two engines when the first and the last answer are produced. Consider that all the queries do not include the DISTINCT modifier and FedX produces duplicates; so, the number of results produced by FedX may be larger that the ones produced by xAvalance. Exactly the same comment applies for the experiments reported in Section 6.7.
Q20) Although the proposed optimization and execution techniques are supposed to be applied on queries with joins and unions, only queries with joins are evaluated in the experimental study. Please, include queries that mix both joins and unions in the evaluation.
Q21) Authors highlight that queries “finish well”; do they want to say that all the answers are produced? Can completeness of the proposed approach ensure? Please, report percentage of overlap between the answers produced by two engines. For queries without the DISTINCT modifier, report the overlap between the produced answers as well as the multiplicity of the duplicates answers coincides. Please, report on metrics to compare multi-sets to report on the completeness of the answers [DBLP:conf/sgai/KostersL07].
Q22) Justify the statement: “an optimal tradeoff is obtained when \phi”. Did the author compute all the plans and find that the one produced by xAvalanche optimizers (GRDY and DP) with fragmentation is optimal. If so, how many triple patterns have these queries? Can this property be met for all the LINEAR shape queries? What are the conditions that ensure optimality of the greedy and dynamic based solution? Is optimality ensure with respect to the estimated cost model or with respect to the real cost? Is monotonicity one of the conditions to be met by the cost model to ensure optimality of the solution? How correlations between the triple patterns will affect the behavior of the proposed approach? How does the proposed cost model compare to the cost model proposed by Naumman et al. [DBLP:conf/icde/NeumannM11]? Could characteristic sets be computed from the plan matrix?
Q23) Given that the proposed engine is not adaptive, please justify sthe differences between the time of the first results of the plans produced by GRDY and DP. How are the properties of these plans? Are these plans able to produce complete answers?
Q24) Explain the meaning of “flexible scheduling of resources”, do the authors mean to say that their approach is adaptive? If so, what are the characteristics of the federation and queries that benefit adaptivity (flexibility) of the proposed approach? Please, indicate the impact at inter- and intra-operator level.
Q25) What are the characteristics of the best cases for xAvalanche? Can these results be generalized, i.e., can the authors probe optimality of their proposed approach whenever cases that meet the characteristics of the best cases? Are the plans optimal with respect to the cost model or to respect to the real cost?
Q26) How would the proposed parallel multi-cast join compare to existing parallel joins, e.g., NUMA-awarw Hash Joins by Lang et al. [DBLP:conf/vldb/LangLA0K13].
Q26) The proposed cost model is quite simple. Please, report on the correlation between the estimates and the real cost values. Additionally, show the impact of the selectivity of the join and duplicates of union on the accuracy of the proposed cost model.
Q27) What will be the expected behavior of xAvalanche if data is provided by Triple Pattern Fragments[DBLP:conf/semweb/VerborghHMHVSCCMW14] instead that from SPARQL endpoints?
@inproceedings{DBLP:conf/sgai/KostersL07,
author = {Walter A. Kosters and
Jeroen F. J. Laros},
title = {Metrics for Mining Multisets},
booktitle = {Research and Development in Intelligent Systems XXIV, Proceedings
of AI-2007, the Twenty-seventh {SGAI} International Conference on
Innovative Techniques and Applications of Artificial Intelligence,
Cambridge, UK, December 2007},
pages = {293--303},
year = {2007}
}
@inproceedings{DBLP:conf/vldb/LangLA0K13,
author = {Harald Lang and
Viktor Leis and
Martina{-}Cezara Albutiu and
Thomas Neumann and
Alfons Kemper},
title = {Massively Parallel NUMA-Aware Hash Joins},
booktitle = {In Memory Data Management and Analysis - First and Second International
Workshops, {IMDM} 2013, Riva del Garda, Italy, August 26, 2013, {IMDM}
2014, Hongzhou, China, September 1, 2014, Revised Selected Papers},
pages = {3--14},
year = {2013}
}
@inproceedings{DBLP:conf/icde/NeumannM11,
author = {Thomas Neumann and
Guido Moerkotte},
title = {Characteristic sets: Accurate cardinality estimation for {RDF} queries
with multiple joins},
booktitle = {Proceedings of the 27th International Conference on Data Engineering,
{ICDE} 2011, April 11-16, 2011, Hannover, Germany},
pages = {984--994},
year = {2011}
}
[Sahoo] S. S. Sahoo. Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery. PhD Thesis. 2010.
@article{DBLP:journals/ws/LoizouAG15,
author = {Antonis Loizou and
Renzo Angles and
Paul T. Groth},
title = {On the formulation of performant {SPARQL} queries},
journal = {J. Web Sem.},
volume = {31},
pages = {1--26},
year = {2015}
}
@inproceedings{DBLP:conf/semweb/VerborghHMHVSCCMW14,
author = {Ruben Verborgh and
Olaf Hartig and
Ben De Meester and
Gerald Haesendonck and
Laurens De Vocht and
Miel Vander Sande and
Richard Cyganiak and
Pieter Colpaert and
Erik Mannens and
Rik Van de Walle},
title = {Querying Datasets on the Web with High Availability},
booktitle = {The Semantic Web - {ISWC} 2014 - 13th International Semantic Web Conference,
Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part {I}},
pages = {180--196},
year = {2014}
}
|