Flexible Query Processing for SPARQL

Tracking #: 1097-2309

Authors: 
Riccardo Frosini
Andrea Calì
Alexandra Poulovassilis
Peter T. Wood

Responsible editor: 
Guest Editors Question Answering Linked Data

Submission type: 
Full Paper
Abstract: 
Flexible querying techniques can enhance users' access to complex, heterogeneous datasets in settings such as Linked Data, where the user may not always know how a query should be formulated in order to retrieve the desired answers. This paper presents query processing algorithms for a fragment of SPARQL 1.1 incorporating regular path queries (property path queries), extended with query approximation and relaxation operators. Our flexible query processing approach is based on query rewriting and returns answers incrementally according to their "distance" from the exact form of the query. We formally show the soundness, completeness and termination properties of our query rewriting algorithm. We also present empirical results that show promising query processing performance for the extended language.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Michael Schmidt submitted on 11/Jul/2015
Suggestion:
Accept
Review Comment:

The bottom line of my previous review was that this paper is a very interesting and relevant contribution, but would require some revisions and clarifications. Having looked at the revised version and the author feedback, I confirm that the new version submitted by the authors addresses these points, in particular:

- Approaches to increase the transparency of relaxed query results are now discussed in section 4.4, including interesting ideas such as the specification of related predicate groups (by the user), output of queries that lead to results, etc.
- The issue of extensibility to SPARQL fragments beyond conjunctive queries has been addressed. In particular, the fragment has been extended to UNION and possible extensions to OPTIONAL is now discussed in the paper.
- The experimental evaluation has been reworked, addressing various comments made in the previous review. Most notably, the experiments now include results for a novel optimization techniques, which is based on the idea of caching query fragments. This strategy (as well as the results) look very promising, as they avoid executing similar patterns over and over again — to a certain degree, this approach also addresses the issue of permutation at query rather than at triple pattern level, which I mentioned in my previous review. Two minor remarks:
-- If the optimisation does not help for single conjunct queries in the general, why not simply disabling it for such cases (I think it is ok to show the effect in the experiments, but you may want to mention in a footnote that it will be disabled in the target system)?
-- Could you explain what data structure you are using for the cache? Is this a (more or less dump) list that you load and iterate over? Or are you using in-memory hash sets that you could reuse to perform the final join efficiently? Also, if I understand right, the two parts of there query are evaluated independently and joined together in the end. It may be possible to “inject” variable bindings from the cached result into the evaluation of the dynamic part, to speed up its evaluation.
-- In any case, I feel there would be still room for improvement through more advanced techniques. I’d suggest to include a very brief discussion in the Related Work section, if space permits.
- Various minor improvements have been made.

Given that the comments in the review have been fully addresses I vote for publication of the revised version.

Review #2
By Aidan Hogan submitted on 28/Jul/2015
Suggestion:
Accept
Review Comment:

While I still have some concerns about the practicality of the system (as expressed in the previous review), I appreciate that the authors have added some discussion on these aspects, introducing some scenarios elaborating how it could be used, more in-depth treatment of strengths and weaknesses, as well as some discussion on optimisations. In general, I am quite happy with how the authors have addressed the comments and I believe that pushing the formal proofs to the Appendix was likewise a good choice in that the paper reads a lot better now. I still have some comments I'd encourage the authors to look at, but I don't see anything significant enough on my side to warrant holding the paper up with another revision cycle. Hence I recommend an accept.

I just have one technical doubt that the authors may wish to look at clarifying, regarding again the notion of an "extended reduction". When introduced, the authors state that "the extended reduction of an ontology K is required". The use of the definite article here suggests somehow that there is a unique extended reduction, which I don't immediately buy. I mentioned it in the previous review and although the text is now clearer, I still have questions over the uniqueness of extended reductions. In Example 3, the authors provide a case where a cyclical ontology leads to such a problem, and thus restrict to acyclical inputs. But I am still not convinced that these extended reductions are unique in acyclical cases, and what it would mean for the rest of the proposals if they are not. Take for example rule (e1) in the paper and the following K:

(x,dom,c) [F1]
(y,sp,x) [F2]
(z,sp,y) [F3]

This K is acyclical. Just considering (e1), we get in cl(K):

(x,dom,c) [F1]
(y,sp,x) [F2]
(z,sp,y) [F3]
(y,dom,c) [C1]
(z,dom,c) [C2]

Now let's consider applying (e1) in reverse.

1) If we apply it in reverse to (F1+F2->C1), and remove C1 first, then we cannot apply (e1) in reverse any more and we've reached a fixpoint at {F1,F2,F3,C2}.

2) If we apply it in reverse to (C1+F3->C2), and remove C2 first, we can still apply step 1) above, so we will reach the fixpoint of {F1,F2,F3}.

Probably this just requires a little more discussion to make clear. Specifically, in the left hand column of page 6, I would like the authors to make clear: is this extended reduction unique for acyclical cases, if so why, and if not, what effect does it have on the later relaxation/approximation steps.

MINOR COMMENTS:
* Page 1, right column, bad box with "(which"
* "arising form" -> "arising from"
* Page 6, right column, RELAX semantics, would be worth fixing that widow for the equation
* Table captions should go on top? Double check the SWJ style.
* "only on triple patterns containing a regular expression in which at least one constant appears" At first I read this as the regular expression must contain the constant. I would rather say triple patterns where the subject or object is constant.
* "We ran each query 6 times ..." I missed this the first time around but dropping the first run might be problematic in that the engine could (at least in theory) cache the query against the result, where the latter query runs would then just be retrieving the answers from the cache, rather than executing them. It would be worth mentioning why this is not problematic: i.e., does Jena TDB cache queries against answers? (This might be a more significant issue if it weren't for the fact that many of the queries are quite slow ... if they were all returning in milliseconds, I would suspect caching.)

Review #3
Anonymous submitted on 03/Aug/2015
Suggestion:
Accept
Review Comment:

This is a revision of the manuscript, extending the conference paper [5] and building on the prior work [13] by two co-authors of the present submission. In this revision, a reasonable effort has been made by the authors to address the reviewers' comments. In particular, an optimization strategy for query evaluation, using incremental computation with caching, has been implemented and described in Section 5. The paper presents interesting, sound and thoroughly elaborated ideas, and thus can be recommended for publication.