Review Comment:
### Summary:
The submission introduces and studies a way to make statements on the completeness and soundness of SPARQL queries over RDF data. Given that RDF assumes the open world assumption, the submission introduces so called "completeness statements". These statements make it possible to express that some parts of the RDF graph actually contain complete information, i.e. that the information provided by some RDF graph is complete for certain topics. This then enables reasoning over queries in order to determine whether the answers provided by a query over such an RDF graph are complete or incomplete. Maybe of even bigger importance is the notion of soundness in the RDF/SPARQL setting considered in the submission: While RDF follows the OWA, the semantics of SPARQL in the presence of negation implements the closed world assumption. Completeness and soundness allow to close this gap: if the RDF graph is complete for the negative parts of the query, the result of the query is guaranteed to be sound, i.e. to be true even over possible extensions of this graph.
The main contributions of the submission are as follows:
- A formalization of these completeness statements for RDF graphs
- A characterization of when a SPARQL basic graph pattern returns complete answers
- A formalization of soundness (distinguishing two kinds of query soundness: answer and pattern soundness; answer soundness asks whether a specific answers is sound, pattern soundness asks whether all answers to a specific pattern are always sound)
- An algorithm for checking completeness as well as a study of the complexity of the problem
- An algorithm and complexity analysis for checking soundness
- The submission then discusses some possible optimizations of these algorithms, and reports on an implementation.
- For the evaluation of the implementation, test data, queries, and completeness statements were created out of Wikidata.
### General Comments:
Deciding whether the answer to a query is complete is a very interesting problem. Beside being an important information on its on, using completeness to close the gap between query languages (especially with negation) that assume CWA and data sources that work on the OWA seems to be a very promising approach to deal with these kinds of problems. The approach chosen in the submission to implement completeness statements appears to be sufficiently simple yet powerful. Also, it seems to be more generally applicable than just for SPARQL and RDF. The submission provides a good study of the properties of the introduced notions (providing useful characterizations and a complexity analysis that seems to cover all major cases). I could not spot any problem with the provided results. One possible criticism is that the empirical evaluation is basically done in a single setting only (of course covering different query sizes). While the results seem promising, the evaluation is not overly extensive. The article is well written and good to read with a big amount of examples that illustrate the concepts in sufficient detail. However, I have a few minor concerns regarding the presentation, as explained in more detail below (mainly the use of bag- and set semantics, but also some additional minor comments).
Overall, the submission covers an interesting topic, has a good number of results that are correct, and for the most parts is well written. I therefore recommend to accept the submission, maybe with a minor revision (I strongly believe that the authors can fix these problems, and I don't think any of these problems require a minor revision, however, in total the number of changes may justify a minor revision. I leave this decision to the editors).
### Detailed Comments:
# Set and Bag Semantics
My only real concern is the somehow sloppy and inconsistent use of - and distinction between - set semantics and bag semantics.
First of all, I don't quite see the benefit of studying bag semantics when throughout the article it seems to actually play a role only once. This is in Example 10 where it is shown that in the presence of projection (the only operation studied in the submission that is actually capable of introducing duplicates) Theorem 3 does not hold. All results on Completeness are presented for BGPs only, where the solution to a BGP is actually a set, since duplicates cannot occur, given that RDF graphs are sets and thus cannot contain duplicates. To me, it seems as if the use of bag semantics would make things unnecessarily complicated without any major benefit. Wouldn't it be easier to work with set semantics throughout the article, and at the end have some (sub)section discussing the effect of bag semantics. This way, some of the problems stated below could be avoided.
Second, the notation \subseteq for set/bag containment is either not used consistently or in a confusing way. Usually, I would assume that without any further definition, \subseteq expresses set containment, thus ignoring multiplicities. And it is actually used in accordance to my assumption at the beginning of the article (e.g. Section 2, where it is used between RDF graphs, i.e. sets of triples, or to express that some set W \subseteq var(P) [line 197]). I.e., it is used between sets. Then, in Definition 4, it is used between bags (in "[[Q]]_G' \subseteq [[Q]]_G"). Thus it does not become clear at this point whether the notion of containment here ignores or includes multiplicities. I think this is actually one out of only two places in the whole article where there is an actual comparison between bags (which gets back to my first comment, on whether dealing with bag semantics throughout the whole article is actually worth it). The second place where there is a comparison between bags is line 913. Here, suddenly \subseteq_s is introduced for "set inclusion". However, I think it is only used once. And already in and after Proposition 10, just \subseteq is used for the containment between sets. Now all of this is not necessarily wrong or actually inconsistent: assuming \subseteq actually means bag inclusion, of course it can be also used between sets, representing the common set inclusion there, and at the only position where set inclusion between bags is needed, \subseteq_s is used. However, if this is the case it is extremely confusing, and should be clarified and made explicit in the Preliminaries (since this would also clarify what kind of inclusion is intended in Definition 4).
Third, I must admit that I do not understand the argument starting on line 432 on why it is sufficient to check only the BGPs for completeness. On the one hand, I do not quite understand what the "therefore" in "we can therefore focus on the BGPs" references to (it seems as if it references to the fact that bag semantics is assumed, and that does not make any sense to me). On the other hand, I do not quite understand how projection can be simply ignored in this case. Let's consider the RDF graphs
G = {(UN member Germany), (Germany lang DE)} and
G' = G \cup {(Germany lang LANG)} [let's assume just any language for the sake of an example]
Consider the completeness statement
Compl((UN member ?m))
and the query
Q=({?m}, {(UN member ?m), (?m lang ?l)})
Now is Q complete over G or not, given that [[Q]]_G = {(?m -> Germany)} and [[Q]]_G' = {(?m -> Germany), (?m -> Germany)}? I don't have an answer to this, neither whether it should be considered to be complete or not (I can find arguments for both decisions), nor whether it is complete according to Definition 4.
Of course, the whole problem with set or bag semantics can be easily fixed, and is by far not as severe as the long comment on it may suggest.
# Choice of the Query Language
Although the submission claims to study the query language SPARQL, the language defined in the Preliminaries and studied throughout the submission is just some conjunctive fragment of SPARQL extended by some form of negation. However, it is far from being full SPARQL (neither 1.0 nor 1.1), as also acknowledged by the authors in the discussion. First of all, I find it somehow strange to call this language SPARQL (especially in the Preliminaries that are written in a way that suggests that the language of SPARQL is defined there). Second, one could argue that, by not covering full SPARQL, the title of the submission promises a little bit too much.
This does not mean that having such a generic conjunctive query language with negation is necessarily bad, but that the presentation might need be adapted a little bit.
# Specifics of RDF and SPARQL
Given this rather simple query language and the different definitions of completeness and soundness, the question arises to what extent these definitions are specific to RDF and SPARQL, or apply to similar languages and settings as well. This works both ways: On the one hand, while it is very well motivated why having e.g. a notion of soundness is a very important tool to bridge the gap between the OWA of RDF and CWA for SPARQL, it does not become clear what are the specifics of RDF and SPARQL that do not allow to simply apply existing results/notions of completeness, mainly from relational databases (the related work section provides good details on how earlier work in the area of the Semantic Web differ from the current submission, but for relational data, this distinction did not become clear to me). Thus, it does not became quite clear to me what the obstacles and challenges are/what has to be taken care of when defining the notions presented in the submission.
On the other hand, this means that it seems to me that these notions and results are also more broadly applicable than just RDF and SPARQL. Or are there any restrictions? Would the results still hold for arbitrary atoms instead of triple patterns (since most of the arguments are based on some mappings/homomorphisms, it at least seems to be the case). Maybe these opportunities could be discussed in a little bit more detail.
Overall, to me the submission makes more the impression to provide a general framework for completeness and soundness reasoning than to be specific for RDF and SPARQL.
# Proposition 2: "and mapping \mu" (at the very end of Prop. 2): I don't think this is correct, but it should rather be "and solution \mu \in [[P]]_G"?
# Lemma 1:
The statement \tilde{P} \subseteq G is rather confusing (and unnecessarily so I think). The problem is, that \tilde{P} \subseteq G can only hold iff P is ground, since otherwise \tilde{P} contains by definition at least one element not contained in G (this fact is even pointed out in the proof of Lemma 1). Thus, \tilde{P} \subseteq G if and only if P \subseteq G. Stating the Lemma in terms of P instead of \tilde{P} would - as far as I checked it - still work, but heavily increase the readability of some proofs, and also heavily increase how intuitive the Lemma is (since stating it in terms of \tilde{P} just makes it more complex). Also, P \subseteq G would actually be in line with the statement in line 697 that states exactly that the saturated BGPs (i.e. P in Lemma 1) has to be in the graph.
The same argument then also applies to Theorem 1 and the notion of \tilde{\mu P} \subseteq G. Again, the use of the freezing functions makes the whole statement way more complex to understand. In the end, containment can only hold if P is ground. Again, also in this case I think the proof can be adapted accordingly (but I might be missing something).
# NRF graph patterns and the complexity of Pattern Soundness:
What is the actual relationship between a graph pattern P, some corresponding NRF graph pattern P' (i.e., P' = NRF(P) if we assume NRF(P) to be a function returning a NRF version of P), and Soundness? I think it is the case that P is sound iff P' is sound? Maybe this relationship could be made explicit, and this might help to clarify the situation on the complexity of pattern soundness. In the first paragraph on page 14 it is stated that a polynomial number of completeness checks suffice to compute NRF(P). However, actually computing NRF(P) is a function problem, and actually looks to be P^NP (while NP is sufficient to check whether some triple pattern can be removed, NP is not powerful enough to decide whether actually all redundant triple patterns have been removed, i.e. that no more triple pattern shall be removed). Maybe the authors could elaborate in a little bit more detail how the NP membership can be established.
# Evalution of the implementation:
The results seem actually promising, and based on the chosen sources (for both, data and queries), the query sizes actually seem to be correct/justified. However, an average query size of 2.6 triple patterns feels rather small nevertheless.
### Typos/Minor Questions:
line 11: I would suggest to use "Knowledge Base" instead of "KB" (especially since I think this is the only occurrence for the next 20 pages) - but since I do not know whether "KB" is usually used instead of "Knowledge Base" without any introduction, I would ask the authors to just take this as a hint and decide based on their experience.
line 150 (p.3): "RDF and SPARQL, and completeness statements": I would suggest to either replace the first "and" by ",", or to replace the second "and" by "as well as".
line 183: (being a little bit picky here) saying that a BGP "consists" of a set of triple patterns is a little bit imprecise, since it's not completely clear what "consists" means. Why not just say "a BGP is a set of triple patterns"?
Preliminaries (Subsection 2.1): While the Preliminaries section introduce all major notions used throughout the article, there are a few minor problems:
- In this section, the notion of an RDF graph is introduced. However, starting with Section 2.2, this term is almost never used, but instead just "graph" is used (an exception to this is e.g.line 301 where RDF graph is used). It would be good if this was announced somewhere at the beginning of the Preliminaries (something along "in the following, we will just use graphs instead of RDF graphs"). Also, beginning with Section 6, the term "data graph" is used, making it three different notions for the same thing.
- The notions dom() and var() (first used in line 189) are not introduced. While their meaning is of course obvious, they are usually defined nevertheless.
- In line 197/198, the notion of a graph pattern has not yet been defined (definition is only in line 213)
- line 198: "[[Q]]_G is obtained": "the evaluation/solution/result [[Q]]_G is obtained". Also, G is not really defined at this place.
- As already mentioned earlier, the wording/presentation leaves the impression that the query language defined here is actually SPARQL - which is not the case.
line 226: "also negated version" -> "also the negated version"
Definition 2 (Completeness Statements): Is there a reason to define the completeness statements as Compl(P) for a nonempty BGP P, instead of just being P? Of course, in the end it does not make a big difference, but there are several occasions throughout the submission (e.g. several proofs), where having the "wrapper" Compl() just makes notation more difficult, since it is necessary to talk about "the BGP of the completeness statement", instead of just "the completeness statement". Is there any benefit in making this distinction?
line 307/308: "one mapping result": either "one mapping" or "one result" (mapping is probably more accurate)
page 5, second column, image at the top: It takes some time to understand what is actually shown in this image, that is, that the new boxes indicate which completeness statements guarantee which part of this graph to be complete. One problem is that the description of the graph comes a few lines after the graph itself, and then has no proper reference to the graph. Maybe make the graph a figure? This would a) allow to provide the most basic information in the caption, and b) to reference this graph at the correct position in the text.
line 347: "the (usa, lang, ?l)": the triple pattern (usa ..."
line 364: "no graph can contain more information about Q_3.": This wording here is a little bit misleading: Q_3 is ground and contained in G_cou. So of course no other graph can contain more information on Q_3, since for a ground query we have that it either maps into the RDF graph, or not. Once it maps into G_cou, what more information should a graph contain about this query?
line 390: "that the members are completely different items": what does this mean? (e.g., different from which items/elements?)
line 508: at the word "statements": there seems to be a rouge "}" between "e" and "m"
line 527: My understanding is that \tilde{c} represents an arbitrary value. However, the notation \tilde{} was introduced and used for the frozen values of a variable. While the frozen values are a convenient way to argue about arbitrary values, there is a difference between the frozen value and an arbitrary value. Thus I suggest to use a different notation for the arbitrary IRI here.
Section 3.3: Why not use a formal definition block/environment for the definition of query soundness - like it was done for completeness? Would be consistent with the previous section and would probably make the definition easier to read.
line 623 (approx.): "while the rest to the graph": this sentence should be checked. Maybe something like "while the remaining parts are mapped to the graph"?
line 660: Reading the paragraph starting at this line for the first time, it was rather confusing, since it is not quite clear what "repeated application", and thus terminating really means in this situation. As a result, also the reason why the 2nd case should lead to termination (and not just return the same result again and again) does not become clear. It makes more sense once one has read the formal parts on the next pages, but this paragraph does not really help in preparing for this. Maybe this could be clarified a little bit.
Table 1 (page 12): Why is "complete" abbreviated? There would be enough space for "complete".
line 879: "in this subsection": replace by "in this article", since the class of graph patterns was defined earlier and not in this subsection.
line 757: "In our algorithm": Strictly speaking, this algorithm has never been described. There is only Theorem 1, and from this a corresponding algorithm can be easily derived. However, since this is not done in the submission, maybe the wording here can be adapted accordingly.
Proposition 7 (but in general for the complexity results): It is slightly more common to not include the fixed parts of the input in the list of given items, but to precede them. E.g., "Let P be a BGP. Given a set C of completeness statements and an RDF graph G, deciding .... is NP-hard. The complexity remains the same even when G is also fixed". Of course, this is a matter of taste, but it makes it easier to grasp which parts are actually fixed.
line 880: "be monotonic": "is monotonic"
line 965: Is it necessary to introduce "sqsubseteq_s" for query containment under set semantics, or could \subseteq_s be used as well (usually, in the presence of set semantics, \subseteq is used for query containment) - but this of course is a matter of taste.
line 1001: Does this hold for arbitrary graph patterns or only NRF patterns?
line 1016: "or where the projection is over all variables in the positive part of the query body":
what is the benefit of allowing projection on variables from the negative part? By definition, they are always not mapped to any value, so they are not in the domain of any solution. Of course, because of exactly this argument, it does not hurt to allow them to be contained in the projection. However, mentioning this property there - even more, without explanation, is quite a distraction from the actual point made there.
second line of example 10: "it is impossible to right-shift any triple": personally, I think "there is some triple that cannot be right-shifted" is the much clearer and easier to understand sentence.
line 1050: "One can show that P is complete if and only if the positive part P+ is complete": Is this hard to show, or as obvious as it seems to be at first glance (in this case, maybe the statement could be adapted accordingly; otherwise a hint on why it is true would be helpful). Why not make this a formal result?
line 1058/1059: Teasing the reader that having a shared variable is important, but not saying anything more about this is ... a little bit cruel?
Equation (2): Strictly speaking, C_T has not been defined. Only C_\tau for \tau \in T was defined previously. However, one can of course argue that the definition via \bigcup_{\tau \in T} C_\tau is immediate.
page 17ff: Already earlier, starting with Section 6, there seems to be a slight change in the style the paper is presented. Especially on page 17, it is no longer the "formal result, then proof" style, but instead some (informal) arguments are provided first, then the result is stated. In case of the first equation on page 17 and its equivalence with [[P_\tau]]_{...} even without providing any formal result at all. Proposition 11 is stated as a formal result but instead of a proof also the arguments are made right before the Proposition. Maybe these results could be presented more in line with the previous results.
page 17, 2nd column: Why is the replacement of variables restricted to non-predicate values only (twice on this page).
page 18, 2nd column: Is the way of creating completeness statements as described here actually helpful in approximating the size of real world completeness statements? The authors report of several completeness statements in natural language for different data sources. How big are formal completeness statements expressing these completeness statements? Also, in the test setting, as a result of how they were created, the size of the completeness statements is very similar to the size of the queries. Is this a realistic scenario (in the examples throughout the article, queries rather had a size two or three times the size of the completeness statements)? Maybe the authors can comment on this.
Table 2: While the runtime for the first 6 lines is somewhat stable, it is significantly higher in the last line. Do the authors have any information on whether this is an outlier or the start of a significant increase in runtime compared to the size of the query?
References (page 25ff) are inconsistent, especially for conference publications. Some are very very minimal (e.g. [1]), others are very detailed (e.g. [9]).
## Appendix:
Why does each proof get its own appendix? This makes the structure rather unclear and results more hard to find. Why not something like one appendix for each section?
line 2146 (page 28): I think [[(P,\emptyset)]] should be [[(P,\mu_{\emptyset})]] ?
line 2186: "It is trivial to see that P is ground": "that then P is ground"
line 2189: Maybe the inclusion [[P]]_G' \subseteq [[P]]_G could be clarified by stating in addition that [[P]]_G' = [[P]]_G = {\mu_\emptyset}.
line 2205: I think (P,\mu_{\emptyset}) should be {(P,\mu_{\emptyset})} ?
line 2207: incomparable to what/wrt. what?
line 2211: By the first claim: By the second claim
line 2215: the second claim: the first claim
page 29, last line of second column: var(p_i): The notation var() is used throughout the article to denote the set of variables in "something", for example a BGP. While I agree that the overloading of this term here is obvious and does not really lead to confusion, maybe not overloading the operator is still advantageous.
page 30, second line of first column: "?\neg p_i" While of course this can be used as a variable name, it looks very strange, and I so far never encountered this notation. Usually, \bar p_i or similar notations are used, since this in addition helps to distinguish between the propositional variables p_i and the variables in the query. Of course, it is not wrong, but it looks very unfamiliar.
line 2257/2258: It would really improve the readability of the proof if, instead of repeating the definition of cruc_{C,G}(P) here, the actual value of it would be provided.
line 2324: Why is there such an emphasis on "directed" graphs? Is there any difference between the 3-col problem of directed and undirected graphs I am not aware of? And if not, I think it would be more natural and less confusing to use the usual 3-col problem, which, for my understanding, is commonly defined over undirected graphs (also, I don't see in the proofs where the direction of the edges is of importance). In case the directed version is of importance, maybe the authors could discuss this briefly (since, as I mentioned, I did not get it).
line 2325: (rather picky) If there is no citation for the complexity of QSAT given, I would say no citation for 3-col is needed as well (especially if the undirected version is used).
lines 2345ff: the proof proceeds a little bit quickly here. Maybe some additional details could be provided. For example, the BGP P_{col} is ground. Thus how does the frozen version of P_{col} differ from P_col itself?
line 2388 (or 2387): (again a little picky) The 'b' in "(a,b,c)"" probably is not intended to be the same 'b' as the one for blue in e.g. "(r, edge, b)". It's not a real problem since they never interfere, but it would be nicer to use different names.
around line 2395: It is "3-uncolorable" in the statement of the claim, and "3-incolorable" at the beginning of the proof of the claim (inconsistent).
line 2400: The fact that the cruc operator returns exactly triples(G_p) is actually independent of the assumption that G_p is not 3-colorable. However, being the first statement after the assumption left me a little bit puzzled since I tried to make the connection to the assumption. Actually, the same argument is also used in the other direction of the proof. Maybe it would be more intuitive to follow for the reader, if this fact was stated before the "Assume G_p is not ...". I.e., right after the first sentence, something like "First, observe ...".
|