SpeCS - An Efficient and Comprehensive SPARQL Query Containment Solver

Tracking #: 2667-3881

Mirko Spasić
Milena Vujošević Janičić

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
The query containment problem is a fundamental computer science problem which was originally defined for relational queries. With the growing popularity of the SPARQL query language, it became relevant and important in this new context: reliable and efficient SPARQL query containment solvers may have various applications within static analysis of queries, especially in the area of query optimization. In this paper, we present a new approach for solving the query containment problem in SPARQL. It is based on reducing the query containment problem to the satisfiability problem in first order logic. It covers a wide range of SPARQL language constructs, including union of conjunctive queries, blank nodes, projections, subqueries, clauses from, filter, optional, graph, etc. It also covers containment under RDF SCHEMA entailment regime, and it can deal with the subsumption relation. We describe an implementation of the approach, an open source solver SpeCS and its thorough experimental evaluation on relevant benchmarks, Query Containment Benchmark and SQCFramework. The evaluation shows that SpeCS is highly efficient and that compared to state-of-the-art solvers, it gives more precise results, and in a shorter amount of time. In addition, SpeCS has the highest coverage of supported language constructs.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jérôme Euzenat submitted on 08/Mar/2021
Major Revision
Review Comment:

The paper deals with SPARQL query containment. The problem is very important and relevant for various semantic web activities, including but not restricted to querying.

The contributions of this paper are, from my standpoint:
- a query-containment prover based on first order logic;
- discussion and clean-up of benchmarks.

The first contribution is original in the sense that other solutions attempted to avoid full first-order logic and have a different and often reduced covereage with respect to the present work. Moreover, the proposed prover, thanks to the use of state-of-the-art reasoners seems to perform efficiently.

There are several points which, from my standpoint, prevent to publish the paper in its current state:
- The structure of the paper is quite unbalanced: there are too much known content and to few original one.
- There are some weird problems with the content.
- The main theorems are missing: this is certainly the most serious issue.
- The numerous English language mistakes make the paper difficult to read.
I come back on these issues one by one.

(1) Structure

A lot of the content before Section 7 (p15) is not new content. This is not well-balanced. It is clear that it is needed to introduce the manipulated concepts. However, this is too much of already published material. Moreover, the presentation provides the basics, with various forward references to extensions (Section 5). I think that it would be good, to (a) simplify the description of SPARQL semantics (with references to other source) and (b) directly cover the whole fragment used in this work to avoid forward references and repetitions. For instance, all figures presenting the grammar Section 5 occupy a lot of space and actually contain only one line worth of information. The discussion on normalisation in 5.1 would benefit from referring to standard work on normalisation in logic and query manipulation.

Another structural problem is that it is natural to discuss query containment (4.2) and query subsumption (6) together. It is not clear why this is not done this way.

The numbering of items is non standard: please ask yourself why it usually does not happen in a paper that two theorems have the same number. This is not acceptable.

(2) Various weird aspects

End of page 5, it is written: 'We follow the standard set semantics...'. As far as I know, but I may be wrong, the standard SPARQL semantics is the bag semantics and not the set semantics. Hence, the sentence is misleading for those who do not know. There is no shame to write: 'Despite the SPARQL semantics being a bag semantics, we follow the standard approach to query containment of using the set semantics \cite{those-who-made-it-standard}.'

It is amazing that the theory does not take directly into account variable renaming but in the last section (p17) it is added to the implementation! The reason why this is strange is that for theory, this is not a problem (a simple homomorphism of variable vectors which could also account for different arities will do), though for an implementation this could be a serious problem.

In Section 4.2, query projection is suddenly suppressed, though it is actually very present in real world queries. This would deserve to be discussed at least during the evaluation (reading Table 1 and 2 it seems that after all it is supported?) and in relation with complexity, because this is a major source of complexity.

The discussion at the beginning of page 2 is unclear: (a) 'we also consider such queries': do you mean we only consider such queries? (b) 'we assume similar features': do you assume similar restrictions?

Being one of the publisher of one of the benchmarks, I am also embarrassed if you found mistakes in them. However, I would have appreciated being contacted about that (if you did this and we did not answer, then the mistake is on us, that may happen). The main reason for me, is that this could benefit everyone (to agree to have them corrected or any other useful action). The main reason for you, is that it could only make your paper stronger if you find a mistake and have it acknowledged by the benchmark designers before submitting. That is the same when adding part of a benchmark, you may have a case, but if you are alone, it is weaker. Actually, it is clear that there is a mistake in rdfs21, but also in rdfs22, for instance. And this may have impact on the results.

The notations are not very elegant, in particular the choice of using monospace fonts (tt) for symbols which are not in SPARQL syntax (like IBLe or gp_1) or the use of the ':=' symbol in math. Definition 3.5 the '\mu\modelsp R := curly brace' is not needed (this is a duplicate of the sentence before), it would just be useful to have a curly brace for the three first line for R is E_1=E_2, the if in front of the 'not' should not be here.
- Definition 4.1 (arity): $\alpha\in\mathcal{P}_s and$ is useless (twice).
- The use of $\mathtt{t}$ in $\sigma_{\mathtt{t}}$ (Definition 4.2) is not very judicious, especially that this is used in Definition 4.4 with $\mathtt{t}$ which is now a variable.
- Definition 4.4: \bot does not seem to have been introduced in the logic, moreover it seems like a 0-arity predicate. (that's a detail).
- I would be worth indicating the time-out. OK now I see it at 20000ms. But it is not explicit: Table 2 only mentions #TO and the text mentions 'time'. It would be also good to justify why is it this value that has been retained.

(3) Main theorems

Clearly, Theorems 4.1's proof are missing. A theorem is a theorem only when it has been proved. If the proof has been published, then it is fine to refer to it. Otherwise, it should be given or this is not a theorem. This is already a first problem.

A second problem is that Theorem 4.1 seems apparently to cover only what is in Section 4, i.e. none of the extensions that make the value of the contribution, and Theorem 4.1bis what is covered in Section 4-5? This may be a problem with the structure of the paper (see 1).

Finally, the paper fully lacks any discussion about complexity/decidability. There should, at least, be a discussion on the benefits and drawbacks of choosing of first-order logic. Ideally, there should be a discussion of the complexity/decidability of the problem and its encoding in FOL.

(4) Language

As an example, there is a constant lack of 'the' article where it is needed.
'The number of datasets', 'verify the equivalence', 'the SPARQL language', 'the query containment problem', 'the semantic web', 'The grammar presented...', 'the select clause', 'the function x', 'the y operator', etc.
Other issues are given below, but this is not extensive.
- 'a distributed environments'
- 'D that figures in' I do not think that 'figure' can be used in that sense (it is also unclear).
- 'the domains of $m$ and $\bar{x}$'
- In that case when' -> 'When' (twice)
- Definition 4.4: 'if the argument is a'
- 'standarad'
- subsumtion (Table 1)
- state-of-the-art with dashes when used as an adjective

Review #2
By Jorge Pérez submitted on 04/Apr/2021
Review Comment:

The paper presents a system to check containment of SPARQL queries. The strategy is the following: given queries q1 and q2, first encode queries as First Order (FO) formulas f1 and f2, and then encode the containment as a validity check for the formula (f1 => f2). The validity is checked by using an FO solver/prover. The authors show that their implementation covers more cases and is more efficient that most competing systems. An implementation is publicly available.

The containment problem for query languages is certainly relevant, specially for optimization purposes. Moreover, practical implementations for the case of SPARQL are welcomed. Thus the motivation of the paper is clear. Unfortunately the scientific contribution of this paper is rather limited as I next explain.

That SPARQL queries can be transformed into FO logic formulas is a know result: citations [67,68] in the submission show that SPARQL can be transformed into non-recursive safe Datalog with negation and thus to relational algebra and First Order logic. Thus, one cannot consider the encoding presented by the authors (Section 3 and the extensions in Section 5) as a significant technical contribution. Encoding containment of FO formulas as a validity problem (Section 4.2) is a standard technique, and thus I would consider Theorem 4.1, that combines both previous observations, somehow as a folklore result and not a contribution of this paper. Up to Section 4 the paper essentially consider conjunctive queries. In Section 5.1 when they present a containment strategy for Unions of Conjunctive Queries, they use a standard result in the containment literature: first transform the queries into a normal form in which unions are the top operator, and then check containment considering the disjuncts separately (work by Halevy et al. on answering queries using views and papers after that). After transforming everything into FO, adding FO axioms of any kind would be just adding additional formulas when checking the validity (Section 7). Thus, given that RDFS axioms are simple enough to be expressed in FO, it is not surprising that RDFS axioms can be considered when checking containment.

Regarding the practical part the authors implemented a parser to go from SPARQL to FO. After doing this transformation they use standard solvers (Z3 and Vampire). I acknowledge that implementing a parser that works and is fast enough to show good results when combined with standard solver, is a difficult engineering task. But from a scientific point of view, there is not much that one can keep from it.

The main contribution of the paper is what is presented in Section 8.4 and 8.6. In Section 8.4 the paper describes the benchmarks used and several problems encountered: some queries that are erroneously marked as one containing the other in the benchmark. These findings would allow to have better benchmarks in the future. In Section 8.6 the paper presents a comparison with current public systems showing that the proposed system outperforms most of them in almost all sets of queries.

My impression is that the paper as it is cannot be accepted. I advise the authors to re-focus the contribution of the paper as a system description. You can save all the content of Sections 3-7 by stating that you encode SPARQL into FO, cite the relevant literature and maybe explaining specific parts that have not been considered in detail before (for example the addition of the GRAPH keyword that I'm not sure that has been considered in previous literature on transforming SPARQL to FO). Details of some translations can be provided in an appendix. You can then say that you exploit this relationship to implement a query containment as FO validity check, and focus on studying the performance of the system comparing it with other works.

Detailed comments and suggestions:

Related work:
The work by Salas and Hogan should be considered as it is very relevant:

- Jaime Salas, Aidan Hogan. Canonicalisation of Monotone SPARQL Queries. International Semantic Web Conference (1) 2018: 600-616
- Jaime Salas, Aidan Hogan. QCan: Normalising Congruent SPARQL Queries. International Semantic Web Conference (P&D/Industry/BlueSky) 2018

In these works the authors show a system to bring SPARQL into a canonical form and thus can be used to check equivalence of queries in a syntactic way. The proposed approach covers a huge portion of the SPARQL language, and the authors show that their system outperform several other previous proposals. Although the focus is not in containment, one can argue that containment is mostly used to check equivalence when performing optimization and thus I think it would be very relevant to present a comparison with this work.

Technical comments:
- In the paragraph after Definition 4.8 you say that checking the containment "Q1 contained_in Q2" is undecidable if Q2 contains projection citing [17]. This comment is misleading since up to that point you were only considering conjunctive queries and containment of conjunctive queries is decidable and in NP even if projection is used (Halevy et al. Answering Queries Using Views). In [17] the undecidability is probed for the case in which queries contain the OPTIONAL operator which complicates all the analysis. Thus one cannot use [17] as a formal argument to not study the containment when Q2 contains projection, at least not up to that point in the paper.

- The transformation to put all UNION as the top operator is very costly, actually exponential and you should discuss this. Also, when only considering conjunction and UNION, the transformation is simple, but when the OPTIONAL operator is used, that transformation is far from trivial (see Proposition 3.8, Claim 3.9 and the discussion after that in [19]).

- I think that there are several "the" words that are missing. For example
* page 2, line 24, right column: "on two most important" --> "on two of the most important"
* page 3, line 7, left column: "With increasing popularity of Semantic Web" --> "With the increasing popularity of the Semantic Web"
* page 3, line 11, right column: "is the same as expressiveness of relational algebra" --> "is the same as the expressiveness of the relational algebra"
* page 7, line 19, right column: "Grammar presented in Figure 1" --> "The grammar presented in Figure 1"
There aer several other places in the paper with similar issues (I think that a proof read by a native speaker would sufice to solve most of this small problems)

- There are two notations that use "[[" and "]]" but both have different fonts (Definitions 3.4 and 3.7 page 6). It is a bit confusing as the reader may think that the font has something to do here. I advise to use the same font (ideally that of Definition 3.7).

- The paper presents "Definition 3.7" in Section 3 and then, in Section 5 a "Definition 3.7(u)". It is a bit confusing to have this numbering for definitions. Also using ellipses (...) inside the definitions is strange. I would recommend to have a single big definition with every possible construct. Or maybe separating it into two, one in Section 3 and another in Section 5. Even more simple would be to have each of them in a table that you can reference from the text.