Characteristic Sets Profile Features: Estimation and Application to SPARQL Query Planning

Tracking #: 2714-3928

Authors: 
Lars Heling
Maribel Acosta

Responsible editor: 
Guest Editors ESWC 2020

Submission type: 
Full Paper
Abstract: 
RDF dataset profiling is the task of extracting a formal representation of a dataset's features. Such features may cover various aspects of the RDF dataset ranging from information on licensing and provenance to statistical descriptors of the data distribution and its semantics. In this work, we focus on the characteristics sets profile features that capture both structural and semantic information of an RDF dataset, making them a valuable resource for different downstream applications. While previous research demonstrated the benefits of characteristic sets in centralized and federated query processing, access to these fine-grained statistics is taken for granted. However, especially in federated query processing, computing this profile feature is challenging as it can be difficult and/or costly to access and process the entire data from all federation members. We address this shortcoming by introducing the concept of a profile feature estimation and propose a sampling-based approach to generate estimations for the characteristic sets profile feature.In addition, we showcase the applicability of these feature estimations in federated querying by proposing a query planning approach that is specifically designed to leverage these feature estimations. In our first experimental study, we intrinsically evaluate our approach on the representativeness of the feature estimation. The results show that even small samples of just $0.5\%$ of the original graph's entities allow for estimating both structural and statistical properties of the characteristic sets profile features. Our second experimental study extrinsically evaluates the estimations by investigating their applicability in our query planner using the well-known FedBench benchmark. The results of the experiments show that the estimated profile features allow for obtaining efficient query plans.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ruben Taelman submitted on 19/Feb/2021
Suggestion:
Minor Revision
Review Comment:

This article is an extension of "Estimating Characteristic Sets for RDF Dataset Profiles Based on Sampling" that was presented at ESWC 2020.
It builds upon the introduction of a sampling-based approach for the creation of characteristic sets profile features.
The motivation of this sampling-based approach is that datasets may be too large (and/or volatile) to (re)create characteristic sets based on the full dataset,
while just selecting smaller subsets may be sufficient.
In order to measure the effectiveness of the sampling-based characteristic sets profile features vs non-sampling-based characteristic sets profile features,
the authors introduce metrics and evaluate it over several real-world datasets.
Furthermore, the approach is incorporated into a new query planner for federated query execution, which is evaluated using FedBench.

This work is a significant extension of the author's previous work.
The authors make true of their desire to investigate their approach in query planning for federated querying,
and show that it can indeed lead to improved query performance.
Furthermore, six more datasets were analyzed, and a more in-depth discussion of the results is provided.

The text is very well written, all figures are clear, experiments are reproducible, and discussions are (very) extensive.
One could argue that some parts (e.g. 5.1) may be a bit *too* extensive up to the point that
they provide many details that can easily be drived from figures and tables.
Nevertheless, I consider this work to be very polished in its current state,
and only have a few minor comments before recommending an acceptance to the Semantic Web Journal.

The only real concern I have with this work is regarding its motivation.
The authors call for the need of a sampling-based approach for CSPFs because "it can be difficult and/or costly to access the entire RDF datasets of all members in the federation".
However, there are elements in the approach that would in fact require access to the entire dataset in order to work.
For example, the weighted sampling approach requires calculation of the out-degree of entities for determining the probability of an entity's sample.
Another example are the statistics-enhanced projection functions, which are based on high-level information about the dataset,
which would therefore require access to the entire RDF dataset.
I believe this concern can be resolved by for example adding an additional paragraph in the introduction.
For instance, by adding a clear use case where such needs are met.
Coming back to this use case near the end of the article could also make the conclusion section stronger.

Minor comments:
* Page 7 (3.3): "While iterating the triples, we determine the 50 characteristic set for each subject in H"
How is this done exactly? I understand this is explained in [21], but for the self-containedness of this article, it would be good to make this more concrete, perhaps with some pseudo-code.
* Page 20 (5.2): "In particular, we only con- sidered the weighted and unweighted sampling methods"
Why not the hybrid method? Not claiming it should be done, but there must have been a reason to *not* do it.

Review #2
By Olaf Hartig submitted on 28/May/2021
Suggestion:
Major Revision
Review Comment:

(I sincerely apologize to the authors that it has taken me so long to produce my review of their manuscript. I had some other things on my mind recently.)

The manuscript introduces a sampling-based data profiling approach for RDF datasets. The resulting profiles are estimations of the characteristic sets of the dataset to be profiled, including estimated statistics about these characteristic sets. Additionally, the manuscript introduces an approach to use such profiles for query planning in a query federation engine.

The strengths and weaknesses of the manuscript in a nutshell:

S1: novel and original
S2: well carried out research work
S3: generally excellent presentation

W1: no evaluation related to the effort of producing the proposed types of dataset profiles
W2: unclear how the proposed dataset profiling approach is meant to be applied in a federation setting
W3: the description of the query planning approach lacks relevant information

The rest of this review elaborates more on each of these points.

S1: The contributions of the authors are original and novel. While it has already been shown that characteristic sets can be used to a great effect for SPARQL query planning, and there also exists an attempt to use them for query planning in a federation setting, the work presented in the manuscript is the first (to the best of my knowledge) that aims to reduce the effort of extracting characteristic sets and related statistics by means of sampling and estimation. Additionally, the presented approach to use the resulting dataset profiles for query planning in a query federation engine is another novelty. The extensive evaluation clearly demonstrates the suitability of both the proposed dataset profiling approach and the corresponding approach to query planning, and it provides detailed insights into relevant properties of these approaches. Hence, the manuscript presents a significant advancement of the state of the art.

S2: The research has been carried out thoroughly. The presented dataset profiling approach has been defined formally in an elegant and easy-to-understand manner. All relevant aspects of the approach have been considered (namely, the different possible options for sampling, the functions for estimating the statistics for the dataset from the statistics of the sample, and the different possible similarity measures to quantify the accuracy of the estimations), and the corresponding decisions and their rationale are well founded, as is clearly documented in the manuscript. The evaluations of the two approaches (dataset profiling and query planning) are well designed, based on clear and relevant questions. The discussion of the experiments and the observations is thorough and drills into all important cases at great detail. As a consequence, the experimental results provide relevant and extensive insights both about the approaches in general and about their different configurations/options.

S3: It was an absolute pleasure to read this manuscript. The English is excellent and the text is well organized, with a structure that is logical and easy to follow. The authors clearly have put a lot of effort into presenting their approach and their results in an understandable and easy-to-digest form. Particularly, I enjoyed the concise and flawless formal exposure of the dataset profiling approach and the presentation of the experimental results in charts and tables that are both detailed and well layed out.

Despite all the praise, there are also some things that I would like the authors to improve:

W1: Given that the motivation of the proposed dataset profiling approach is to reduce the effort of extracting characteristic sets and related statistics, I would have expected that the evaluation comes back to this motivation, which is not the case. In my opinion, it is important that the manuscript shows that creating an estimated CSPF instead of a complete/accurate CSPF will indeed be less effort. A possible metric to show such a reduction of effort may be the time it requires to create the CSPF.

W2: While I can see how the proposed dataset profiling approach can be applied if one has the dataset(s) available, it is not entirely clear to me how the approach is meant to be applied in a federation setting where we cannot assume that the datasets are available completely to the query engine. The manuscript does not address this question at all. I am expecting a discussion of this question. Even if the idea is that the providers of the endpoints have to create and publish the profile of their dataset, this assumption needs to be stated.

W3: The description of the query planning approach in Section 4 lacks relevant information. While some of the things that I find missing are mentioned later in the manuscript and others can be figured by having knowledge of SPARQL federation engines and reading between the lines, I would like this section to be more explicit about the following aspects of the approach.

i) The plans created by the approach are not guaranteed to produce complete query results. I am okay with that and I notice that the manuscript mentions this later in the evaluation. However, since this is an important limitation of the approach, it must be mentioned already in the context of introducing the approach. In fact, I think this limitation should not only be mentioned but it should also be explained why the plans may not produce complete query results. The discussion later in the manuscript makes it sound as if the only reason for potentially incomplete results is the fact that the CSPF is just an estimation. However, that's not the only reason. Even with a complete CSPF, there are cases in which the approach may create a plan that may produce an incomplete query result. The reason for that is the greedy nature of Algorithm 1: As an example, consider a case in which |SSP'|>1 and there are at least three endpoints, ep1, ep2, and ep3, with the following properties. Endpoint ep1 has a characteristic set S1 such that preds(SSP') is in S1 and, thus, ep1 is added to R' in line 11. Next, ep2 has a characteristic set S2 and ep3 has a characteristic set S3 such that preds(B2) is in S2 and preds(B3) is in S3 where B2 and B3 are nonempty proper subsets of SSP' and the union of B2 and B3 is SSP' (i.e., SSP' can be partitioned into these two smaller subject-based star shaped BGPs). Notice that the result of B2 over ep2 may be joined with the result of B3 over ep3, which may produce a nonempty join result that may be needed for the overall query result to be complete. However, after having added ep1 to R' (see above), the algorithm does not add (B2,ep2) and (B3,ep3) to the query decomposition that it produces (that is, in particular, due to lines 15-16) and, thus, the result of B2 over ep2 and of B3 over ep3 will not be fetched by any query plan created based on the decomposition.

ii) In the same context, it is not clear whether query results are meant to be sets or multisets. If they are meant to be multisets, then the approach used for the answer completeness metric (page 21) is incorrect because, by creating and querying a union of all graphs, potential duplicates of solutions may not be possible to produce. On the other hand, if query results are meant to be sets, it needs to be emphasized that (and how) duplicates are removed by the query engine.

iii) Before discussing the cardinality estimation and the join ordering approach, there should be a brief description of the query plans in general and how they are created from the query decomposition. I assume that each (SE_i,R_i) in D becomes a UNION, and all these unions for the different (SE_i,R_i) are then joined.

iv) The description of the cardinality estimation for some of the types of triple patterns (pages 12-13) is not entirely clear.
- For Type I, why does the estimation consider all predicates and why does it consider all characteristic sets? This does not seem to make sense, given that both the subject and the predicate of these triple patterns are constants.
- For Type II, the meaning of the term inside the sum of the formula is not clear. How is that defined?

v) The definition of the notion of a query decomposition (in the text on the top-right of page 11) seems somewhat incomplete. I suspect there are a few more conditions such as: all the SE_i are pairwise disjoint and the union of all the SE_i is P. These conditions should be mentioned here and, in this context, it would also be helpful if you elaborate a bit on what "relevant" means when talking about the data sources R_i.

Some minor things:

* In the formula of Definition 2.3 (page 4), what does 'max' range over?

* In the example on the top left of page 6, the value of count(S_1) is slightly different from the one in the table on page 2. Is this just a typo or is there a deeper reason for it?

* In the beginning of Section 4 (second paragraph) you say that the Odyssey "query planner assumes complete information." I would appreciate if you add one more sentence that elaborates a bit more on this statement. In what sense are complete characteristic sets necessary for the Odyssey approach?

* In the formula for card(SE_i,R_i), close to the top-right of page 12, below the sum symbol it should be ep instead of e.

Review #3
By Juan Reutter2 submitted on 07/Jul/2021
Suggestion:
Minor Revision
Review Comment:

The paper provides two important contributions.

First is the development of characteristic sets profiles. From the definition of characteristic sets, the paper proceeds to carefully explain what sort of profiles can be used for dealing with characteristic sets, and then gives a varierty of ways of approximating these measures via sampling.
Next is the application of this framework for the specific problem of estimating join orderings in federated queries. Here we don’t see that much development from the starting point of the original characteristic sets paper, although the algorithm and some border cases need to be reconsidered to adapt for sampling. However, as the experiment show, the impact of sample-based profiling in this problem can be huge.

This paper suggests an interesting and fresh discussion on profiling for RDF databases, and remind us that there is still much to do in terms of making federated queries work. The community will benefit from publishing this paper, and I have no doubt that future research products will follow this paper.

But I also think there is a lot of room for improving this paper, and there are some key issues that I would like the authors to address. I ask these from the conviction that these changes will improve the reading and impact of the paper.

- Sections 3.2 , 3.3 and 3.4 must address the “why these and not others?” question. Why are you suggesting sampling as you write, and not, for example, sampling using graph traversal approaches? is it because graph traversal is slower? is it open whether there are better measures? One could ask similar questions for the projection functions (altough here I don’t see any alternative come to mind so quickly), and also for similarity measures (you could for example try fitting the distribution of predicates and then do something with that).
This is a very good opportunity to explain the rationale of your choices, what applications or past experiences did you had in mind when proposing them, and what would you considering interesting approaches for future work.

- The formal writing in section 4 is much more sloppy. If you are only going to consider conjunctive patterns, then define those? You could have a look at Hogan, Riveros, Rojas and Soto from ISWC 2019, it contains a very clean formalization of SPARQL conjunctive patterns. Then, Algorithm 1 needs a definition of what is a star-shaped query. It needs a clear statement of what is the intended output: SSP’s that are coupled together with relevant sources, that q is the size of the SSPs, and a few other minor details. An unrelated question: surely you can optimize the part where you build D by iterating all possible SSPs of all possible sizes, first by storing a “visited” list so that you don’t query what you alrady have, and then by reasoning about cardinalities you already have: if (x,p1,y) has cardinality N and (x,p2,z) has cardinality M, then you know [(x,p1,y) and (x,p2,z)] should be lower or equal than the minimum between M and N, and this may be enough for your purposes. Have you thought about these heuristics?

- I was a bit confused with the goal of the first round of examples. I understand the questions you are asking (Q1-Q4), but why do you ask these questions in the first place? why do you want to check all of this? To prove my point, if I read Q1 I would answer “bigger sample size always produces better samples”. But perhaps this is not so obvious, or perhaps you are looking to come up with optimal sample sizes, or I don’t know…

Other specific comments:

- In lines 44+ pof the second page, you mention that a limitation of full charactristic sets is “that he computational complexity to get the characteristic sets for an RDF dataset with n triples is in O(n · log(n) + n) “. Do you actually have the lower bound for this? Because if not, I would not put that the complexity is a limitation, but rather that we simply do not know how to compute them fast enough - and that is a perfectly fine limitation. Related to that question: has anyone studied the problem of maintaining characteristic sets? At a first glance I would say they are quite simple to maintain under insertions/deletions, so that would be another approach apart from sampling.

- Section 5.1.3: In the conclusions after Q1. Again, why is this not obvious? And regarding Q4: larger sample sizes involve slower computation times. Perhaps you could try to reason about this tradeoff, given the times taken to compute the statistics in your experiments.

- Table 4: baseline 1000%?

- Table 4 (and experiments in general) After reading this I think I really need to see what happens with 7.5%. Or more generally: at some point the lager execution times should kick in and you should start seeing longer query times. Would you be able to comment on the sample size when this starts to happen?

- page 25, line 3: you are missing a word somewhere I think.