Completeness and Soundness Guarantees for Conjunctive SPARQL Queries over RDF Data Sources with Completeness Statements

Tracking #: 1966-3179

Fariz Darari
Werner Nutt
Simon Razniewski
Sebastian Rudolph

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
RDF generally follows the open-world assumption: information is incomplete by default. Consequently, SPARQL queries cannot retrieve with certainty complete answers, and even worse, when they involve negation, it is unclear whether they produce sound answers. Nevertheless, there is hope to lift this limitation. On many specific topics (e.g., children of Trump, Apollo 11 crew, EU founders), RDF data sources contain complete information, a fact that can be made explicit through completeness statements. In this work, we leverage completeness statements over RDF data sources to provide guarantees of completeness and soundness for conjunctive SPARQL queries. We develop a technique to check whether query completeness can be guaranteed by taking into account also the specifics of the queried graph, and analyze the complexity of such checking. For queries with negation, we approach the problem of query soundness checking, and distinguish between answer soundness (i.e., is an answer of a query sound?) and pattern soundness (i.e., is a query as a whole sound?). We provide a formalization and characterize the soundness problem via a reduction to the completeness problem. We further develop heuristic techniques for completeness checking, and conduct experimental evaluations based on Wikidata, a prominent, real-world knowledge base, to demonstrate the feasibility of our approach.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Sebastian Skritek submitted on 05/Sep/2018
Review Comment:

In my opinion, the submission has improved significantly:
the SPARQL fragment under consideration is now clear, the problem with bag vs. set semantics has been resolved, the relationship to and extension of previous work is now much clearer, and also the goals and results in Sections 6 and 7 are much easier to grasp now.

As also mentioned by the other reviewers, some of the definitions are very technical and have quite a "build up" (i.e., are based on a chain of definitions), but this seems to be inherent to the problem. Also, the ideas and purpose of these definitions are now well motivated and described in a way that makes it easier for the reader to follow them. There is a single exception (the definition of the "crucial part") that will be described below in more detail.

Overall, as mentioned, the submission benefited greatly from the revision (also, I could hardly spot any mistakes) and I suggest to accept it.


The only comment I would like to highlight somehow, as mentioned above, is the presentation of the definition of the "crucial part (cruc)". For this notion, no real explanation for why it is defined the way it is defined is given. This makes this notion really hard to grasp and understand - at least for me. This is especially true for why the definition includes the "\cup G": The pattern P and T_C(P) intuitively made sense to me, but coming up with a motivation/explanation for why G is included as well was much harder. The problem here is that no real explanation for this decision is given.
Although lines 690-700 with additional explanations were added, these lines basically repeat the formal definition of cruc in text form (i.e., the "what"), but do not provide any additional information on the "why". (Maybe I missed something and it is completely obvious, but to me it was not.) However, reading the formal definition was not the problem, but making sense out of it.
A similar comment also applies to Example 5: while it is very helpful in demonstrating how cruc works, and thus in getting familiar with it and internalizing it, the example again only explains the "what" and not the "why" (why is it important to add G_{cou}, what is the reasoning behind this).
Finally, also the three lines before equation (1) do not help in this respect, because it is again not stated why the remaining parts are matched to the graph.
Thus, if the explanations and the example could focus a little bit more on this (i.e., the intuition/motivation), it would massively help to understand this central concept.

======== Minor Comments and Typos

page 3, top of 2nd column: There is a slight inconsistency between line 135 (stating that the work considers the conjunctive fragment) and line 140 (conjunctive fragment with negation). These two fragments are often considered to be distinct - although there is no problem to assume the "conjunctive fragment" to include negation, but then the notion should be used consistently (otherwise, once both notions appear one assumes that just the "conjunctive fragment" does not include negation - because why would it be emphasized in the other cases).
Also, "conjunctive fragment" is in line with the title, while "conjunctive fragment with negation is not". This occurs again in line 271 (however, there is indeed a good reason not to blow up the title even more).

line 260: "with negation" -> "without negation" (?)

lines 523ff: Strictly speaking C_{spa} and C_{eu} do not seem to be defined/introduced. Could e.g. be solved by a simple "defined analogously to the completeness statements G_{ger} and G_{usa}". This would also resolve the problem that it is not completely clear what they say about completeness

line 685+6: "instantiated completely": if this is a well-established term then it's of course fine, but at least to me this notion is a little bit misleading, as if it would describe a single mapping and the fact that this mapping instantiates all variables within a certain subset of all variables occurring in the BGP (rather than talking about the set of all instantiations). So maybe this could be reformulated to be less ambiguous.

line 779: For first and third case it is nicely described what these cases mean (e.g. "corresponds to the non-existence of the query answer in any possible extension"), and the descriptions provide a good intuition on why these cases lead to these results. For the second case, this is never said explicitly. Although it can be somehow concluded from the definition of saturation, it's not laid out as nicely as for the first and third case (the one sentence summary for the second case only states the more than obvious fact that it generates again the input mapping).

lines 902-905: Although likely debatable, in this situation I personally don't think the fact that the reduction uses a fixed graph is a good "reason" for the hardness. This is for two reasons. First, the problem would still be hard even if the previous proof would not use a fixed graph. Second, having a fixed graph in the proof does not explain the hardness (i.e., is a reason), but is rather a prerequisite. Personally, in such situations I would prefer as a "reason" some property of the problem (not one specific proof) that provides some intuition on why the problem remains hard - which could even be an explanation/intuition on why having a fixed graph is actually sufficient to show the reduction.

line 1167: 'is similar to' -> 'is negation similar to'

line 1182: 'is similar to' -> 'is negation similar to'

line 2358: 'can generalized' -> 'can be generalized'

line 2366: 'hat not yet considered' -> 'has not yet been considered'

lines 3040 and 3043: '\subseteq_s' -> just '\subseteq' (?)

line 3067: again "\subseteq_s"

line 3089 (numbering is a little bit off there: 4 lines below 3085, not 1 line before 3090): "ue to" - "due to"?

Review #2
By Martín Ugarte submitted on 26/Nov/2018
Review Comment:

Because my session expired I lost my entire review when clicking the "submit review" button. This review took a good amount of work and I simply don't have the time to write it again. I'm very sorry to the authors for this, it is extremely frustrating.

Here is a summary:
The authors did an excellent job addressing all the comments in the original review, in particular regarding set/bag semantics, the fragment under consideration, and the readability of sections 3-5. The results in sections 6 and 7 are not too strong, but stated as heuristics and with a clear experimental objective they are a good contribution to the paper. The modifications to sections 1, 8 and 9 are a great addition in that they help to understand exactly where this paper stands and what are the contributions. The paper is now easy to follow, well written and pleasant to read, I only found few typos. Overall I think it has vastly improved since the first version and it should be accepted.