Review Comment:
(I sincerely apologize to the authors that it has taken me so long to produce my review of their manuscript. I had some other things on my mind recently.)
The manuscript introduces a sampling-based data profiling approach for RDF datasets. The resulting profiles are estimations of the characteristic sets of the dataset to be profiled, including estimated statistics about these characteristic sets. Additionally, the manuscript introduces an approach to use such profiles for query planning in a query federation engine.
The strengths and weaknesses of the manuscript in a nutshell:
S1: novel and original
S2: well carried out research work
S3: generally excellent presentation
W1: no evaluation related to the effort of producing the proposed types of dataset profiles
W2: unclear how the proposed dataset profiling approach is meant to be applied in a federation setting
W3: the description of the query planning approach lacks relevant information
The rest of this review elaborates more on each of these points.
S1: The contributions of the authors are original and novel. While it has already been shown that characteristic sets can be used to a great effect for SPARQL query planning, and there also exists an attempt to use them for query planning in a federation setting, the work presented in the manuscript is the first (to the best of my knowledge) that aims to reduce the effort of extracting characteristic sets and related statistics by means of sampling and estimation. Additionally, the presented approach to use the resulting dataset profiles for query planning in a query federation engine is another novelty. The extensive evaluation clearly demonstrates the suitability of both the proposed dataset profiling approach and the corresponding approach to query planning, and it provides detailed insights into relevant properties of these approaches. Hence, the manuscript presents a significant advancement of the state of the art.
S2: The research has been carried out thoroughly. The presented dataset profiling approach has been defined formally in an elegant and easy-to-understand manner. All relevant aspects of the approach have been considered (namely, the different possible options for sampling, the functions for estimating the statistics for the dataset from the statistics of the sample, and the different possible similarity measures to quantify the accuracy of the estimations), and the corresponding decisions and their rationale are well founded, as is clearly documented in the manuscript. The evaluations of the two approaches (dataset profiling and query planning) are well designed, based on clear and relevant questions. The discussion of the experiments and the observations is thorough and drills into all important cases at great detail. As a consequence, the experimental results provide relevant and extensive insights both about the approaches in general and about their different configurations/options.
S3: It was an absolute pleasure to read this manuscript. The English is excellent and the text is well organized, with a structure that is logical and easy to follow. The authors clearly have put a lot of effort into presenting their approach and their results in an understandable and easy-to-digest form. Particularly, I enjoyed the concise and flawless formal exposure of the dataset profiling approach and the presentation of the experimental results in charts and tables that are both detailed and well layed out.
Despite all the praise, there are also some things that I would like the authors to improve:
W1: Given that the motivation of the proposed dataset profiling approach is to reduce the effort of extracting characteristic sets and related statistics, I would have expected that the evaluation comes back to this motivation, which is not the case. In my opinion, it is important that the manuscript shows that creating an estimated CSPF instead of a complete/accurate CSPF will indeed be less effort. A possible metric to show such a reduction of effort may be the time it requires to create the CSPF.
W2: While I can see how the proposed dataset profiling approach can be applied if one has the dataset(s) available, it is not entirely clear to me how the approach is meant to be applied in a federation setting where we cannot assume that the datasets are available completely to the query engine. The manuscript does not address this question at all. I am expecting a discussion of this question. Even if the idea is that the providers of the endpoints have to create and publish the profile of their dataset, this assumption needs to be stated.
W3: The description of the query planning approach in Section 4 lacks relevant information. While some of the things that I find missing are mentioned later in the manuscript and others can be figured by having knowledge of SPARQL federation engines and reading between the lines, I would like this section to be more explicit about the following aspects of the approach.
i) The plans created by the approach are not guaranteed to produce complete query results. I am okay with that and I notice that the manuscript mentions this later in the evaluation. However, since this is an important limitation of the approach, it must be mentioned already in the context of introducing the approach. In fact, I think this limitation should not only be mentioned but it should also be explained why the plans may not produce complete query results. The discussion later in the manuscript makes it sound as if the only reason for potentially incomplete results is the fact that the CSPF is just an estimation. However, that's not the only reason. Even with a complete CSPF, there are cases in which the approach may create a plan that may produce an incomplete query result. The reason for that is the greedy nature of Algorithm 1: As an example, consider a case in which |SSP'|>1 and there are at least three endpoints, ep1, ep2, and ep3, with the following properties. Endpoint ep1 has a characteristic set S1 such that preds(SSP') is in S1 and, thus, ep1 is added to R' in line 11. Next, ep2 has a characteristic set S2 and ep3 has a characteristic set S3 such that preds(B2) is in S2 and preds(B3) is in S3 where B2 and B3 are nonempty proper subsets of SSP' and the union of B2 and B3 is SSP' (i.e., SSP' can be partitioned into these two smaller subject-based star shaped BGPs). Notice that the result of B2 over ep2 may be joined with the result of B3 over ep3, which may produce a nonempty join result that may be needed for the overall query result to be complete. However, after having added ep1 to R' (see above), the algorithm does not add (B2,ep2) and (B3,ep3) to the query decomposition that it produces (that is, in particular, due to lines 15-16) and, thus, the result of B2 over ep2 and of B3 over ep3 will not be fetched by any query plan created based on the decomposition.
ii) In the same context, it is not clear whether query results are meant to be sets or multisets. If they are meant to be multisets, then the approach used for the answer completeness metric (page 21) is incorrect because, by creating and querying a union of all graphs, potential duplicates of solutions may not be possible to produce. On the other hand, if query results are meant to be sets, it needs to be emphasized that (and how) duplicates are removed by the query engine.
iii) Before discussing the cardinality estimation and the join ordering approach, there should be a brief description of the query plans in general and how they are created from the query decomposition. I assume that each (SE_i,R_i) in D becomes a UNION, and all these unions for the different (SE_i,R_i) are then joined.
iv) The description of the cardinality estimation for some of the types of triple patterns (pages 12-13) is not entirely clear.
- For Type I, why does the estimation consider all predicates and why does it consider all characteristic sets? This does not seem to make sense, given that both the subject and the predicate of these triple patterns are constants.
- For Type II, the meaning of the term inside the sum of the formula is not clear. How is that defined?
v) The definition of the notion of a query decomposition (in the text on the top-right of page 11) seems somewhat incomplete. I suspect there are a few more conditions such as: all the SE_i are pairwise disjoint and the union of all the SE_i is P. These conditions should be mentioned here and, in this context, it would also be helpful if you elaborate a bit on what "relevant" means when talking about the data sources R_i.
Some minor things:
* In the formula of Definition 2.3 (page 4), what does 'max' range over?
* In the example on the top left of page 6, the value of count(S_1) is slightly different from the one in the table on page 2. Is this just a typo or is there a deeper reason for it?
* In the beginning of Section 4 (second paragraph) you say that the Odyssey "query planner assumes complete information." I would appreciate if you add one more sentence that elaborates a bit more on this statement. In what sense are complete characteristic sets necessary for the Odyssey approach?
* In the formula for card(SE_i,R_i), close to the top-right of page 12, below the sum symbol it should be ep instead of e.
|