Ensuring the Completeness and Soundness of SPARQL Queries Using Completeness Statements about RDF Data Sources

Tracking #: 1765-2977

Authors: 
Fariz Darari
Werner Nutt
Simon Razniewski
Sebastian Rudolph

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
Abstract: 
RDF generally follows the open-world assumption: information is incomplete by default. Consequently, SPARQL queries cannot retrieve with certainty complete answers, and even worse, when they involve negation, it is unclear whether they produce sound answers. Nevertheless, there is hope to lift this limitation. On many specific topics (e.g., children of Trump, Apollo 11 crew, EU founders), RDF data sources contain complete information, a fact that can be made explicit through completeness statements. In this work, we leverage completeness statements to bridge the gap between RDF and SPARQL. We first develop a technique to check query completeness based on RDF data with completeness information. For queries with negation, we approach the problem of query soundness checking. We provide a formalization and characterize the soundness problem via a reduction to the completeness problem. We further develop efficient methods for completeness checking, and conduct experimental evaluations based on Wikidata to demonstrate the feasibility of our approach.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Sebastian Skritek submitted on 30/Jan/2018
Suggestion:
Minor Revision
Review Comment:

### Summary:

The submission introduces and studies a way to make statements on the completeness and soundness of SPARQL queries over RDF data. Given that RDF assumes the open world assumption, the submission introduces so called "completeness statements". These statements make it possible to express that some parts of the RDF graph actually contain complete information, i.e. that the information provided by some RDF graph is complete for certain topics. This then enables reasoning over queries in order to determine whether the answers provided by a query over such an RDF graph are complete or incomplete. Maybe of even bigger importance is the notion of soundness in the RDF/SPARQL setting considered in the submission: While RDF follows the OWA, the semantics of SPARQL in the presence of negation implements the closed world assumption. Completeness and soundness allow to close this gap: if the RDF graph is complete for the negative parts of the query, the result of the query is guaranteed to be sound, i.e. to be true even over possible extensions of this graph.

The main contributions of the submission are as follows:
- A formalization of these completeness statements for RDF graphs
- A characterization of when a SPARQL basic graph pattern returns complete answers
- A formalization of soundness (distinguishing two kinds of query soundness: answer and pattern soundness; answer soundness asks whether a specific answers is sound, pattern soundness asks whether all answers to a specific pattern are always sound)
- An algorithm for checking completeness as well as a study of the complexity of the problem
- An algorithm and complexity analysis for checking soundness
- The submission then discusses some possible optimizations of these algorithms, and reports on an implementation.
- For the evaluation of the implementation, test data, queries, and completeness statements were created out of Wikidata.

### General Comments:

Deciding whether the answer to a query is complete is a very interesting problem. Beside being an important information on its on, using completeness to close the gap between query languages (especially with negation) that assume CWA and data sources that work on the OWA seems to be a very promising approach to deal with these kinds of problems. The approach chosen in the submission to implement completeness statements appears to be sufficiently simple yet powerful. Also, it seems to be more generally applicable than just for SPARQL and RDF. The submission provides a good study of the properties of the introduced notions (providing useful characterizations and a complexity analysis that seems to cover all major cases). I could not spot any problem with the provided results. One possible criticism is that the empirical evaluation is basically done in a single setting only (of course covering different query sizes). While the results seem promising, the evaluation is not overly extensive. The article is well written and good to read with a big amount of examples that illustrate the concepts in sufficient detail. However, I have a few minor concerns regarding the presentation, as explained in more detail below (mainly the use of bag- and set semantics, but also some additional minor comments).
Overall, the submission covers an interesting topic, has a good number of results that are correct, and for the most parts is well written. I therefore recommend to accept the submission, maybe with a minor revision (I strongly believe that the authors can fix these problems, and I don't think any of these problems require a minor revision, however, in total the number of changes may justify a minor revision. I leave this decision to the editors).

### Detailed Comments:

# Set and Bag Semantics

My only real concern is the somehow sloppy and inconsistent use of - and distinction between - set semantics and bag semantics.

First of all, I don't quite see the benefit of studying bag semantics when throughout the article it seems to actually play a role only once. This is in Example 10 where it is shown that in the presence of projection (the only operation studied in the submission that is actually capable of introducing duplicates) Theorem 3 does not hold. All results on Completeness are presented for BGPs only, where the solution to a BGP is actually a set, since duplicates cannot occur, given that RDF graphs are sets and thus cannot contain duplicates. To me, it seems as if the use of bag semantics would make things unnecessarily complicated without any major benefit. Wouldn't it be easier to work with set semantics throughout the article, and at the end have some (sub)section discussing the effect of bag semantics. This way, some of the problems stated below could be avoided.

Second, the notation \subseteq for set/bag containment is either not used consistently or in a confusing way. Usually, I would assume that without any further definition, \subseteq expresses set containment, thus ignoring multiplicities. And it is actually used in accordance to my assumption at the beginning of the article (e.g. Section 2, where it is used between RDF graphs, i.e. sets of triples, or to express that some set W \subseteq var(P) [line 197]). I.e., it is used between sets. Then, in Definition 4, it is used between bags (in "[[Q]]_G' \subseteq [[Q]]_G"). Thus it does not become clear at this point whether the notion of containment here ignores or includes multiplicities. I think this is actually one out of only two places in the whole article where there is an actual comparison between bags (which gets back to my first comment, on whether dealing with bag semantics throughout the whole article is actually worth it). The second place where there is a comparison between bags is line 913. Here, suddenly \subseteq_s is introduced for "set inclusion". However, I think it is only used once. And already in and after Proposition 10, just \subseteq is used for the containment between sets. Now all of this is not necessarily wrong or actually inconsistent: assuming \subseteq actually means bag inclusion, of course it can be also used between sets, representing the common set inclusion there, and at the only position where set inclusion between bags is needed, \subseteq_s is used. However, if this is the case it is extremely confusing, and should be clarified and made explicit in the Preliminaries (since this would also clarify what kind of inclusion is intended in Definition 4).

Third, I must admit that I do not understand the argument starting on line 432 on why it is sufficient to check only the BGPs for completeness. On the one hand, I do not quite understand what the "therefore" in "we can therefore focus on the BGPs" references to (it seems as if it references to the fact that bag semantics is assumed, and that does not make any sense to me). On the other hand, I do not quite understand how projection can be simply ignored in this case. Let's consider the RDF graphs

G = {(UN member Germany), (Germany lang DE)} and
G' = G \cup {(Germany lang LANG)} [let's assume just any language for the sake of an example]

Consider the completeness statement

Compl((UN member ?m))

and the query

Q=({?m}, {(UN member ?m), (?m lang ?l)})

Now is Q complete over G or not, given that [[Q]]_G = {(?m -> Germany)} and [[Q]]_G' = {(?m -> Germany), (?m -> Germany)}? I don't have an answer to this, neither whether it should be considered to be complete or not (I can find arguments for both decisions), nor whether it is complete according to Definition 4.

Of course, the whole problem with set or bag semantics can be easily fixed, and is by far not as severe as the long comment on it may suggest.

# Choice of the Query Language

Although the submission claims to study the query language SPARQL, the language defined in the Preliminaries and studied throughout the submission is just some conjunctive fragment of SPARQL extended by some form of negation. However, it is far from being full SPARQL (neither 1.0 nor 1.1), as also acknowledged by the authors in the discussion. First of all, I find it somehow strange to call this language SPARQL (especially in the Preliminaries that are written in a way that suggests that the language of SPARQL is defined there). Second, one could argue that, by not covering full SPARQL, the title of the submission promises a little bit too much.

This does not mean that having such a generic conjunctive query language with negation is necessarily bad, but that the presentation might need be adapted a little bit.

# Specifics of RDF and SPARQL

Given this rather simple query language and the different definitions of completeness and soundness, the question arises to what extent these definitions are specific to RDF and SPARQL, or apply to similar languages and settings as well. This works both ways: On the one hand, while it is very well motivated why having e.g. a notion of soundness is a very important tool to bridge the gap between the OWA of RDF and CWA for SPARQL, it does not become clear what are the specifics of RDF and SPARQL that do not allow to simply apply existing results/notions of completeness, mainly from relational databases (the related work section provides good details on how earlier work in the area of the Semantic Web differ from the current submission, but for relational data, this distinction did not become clear to me). Thus, it does not became quite clear to me what the obstacles and challenges are/what has to be taken care of when defining the notions presented in the submission.
On the other hand, this means that it seems to me that these notions and results are also more broadly applicable than just RDF and SPARQL. Or are there any restrictions? Would the results still hold for arbitrary atoms instead of triple patterns (since most of the arguments are based on some mappings/homomorphisms, it at least seems to be the case). Maybe these opportunities could be discussed in a little bit more detail.

Overall, to me the submission makes more the impression to provide a general framework for completeness and soundness reasoning than to be specific for RDF and SPARQL.

# Proposition 2: "and mapping \mu" (at the very end of Prop. 2): I don't think this is correct, but it should rather be "and solution \mu \in [[P]]_G"?

# Lemma 1:

The statement \tilde{P} \subseteq G is rather confusing (and unnecessarily so I think). The problem is, that \tilde{P} \subseteq G can only hold iff P is ground, since otherwise \tilde{P} contains by definition at least one element not contained in G (this fact is even pointed out in the proof of Lemma 1). Thus, \tilde{P} \subseteq G if and only if P \subseteq G. Stating the Lemma in terms of P instead of \tilde{P} would - as far as I checked it - still work, but heavily increase the readability of some proofs, and also heavily increase how intuitive the Lemma is (since stating it in terms of \tilde{P} just makes it more complex). Also, P \subseteq G would actually be in line with the statement in line 697 that states exactly that the saturated BGPs (i.e. P in Lemma 1) has to be in the graph.

The same argument then also applies to Theorem 1 and the notion of \tilde{\mu P} \subseteq G. Again, the use of the freezing functions makes the whole statement way more complex to understand. In the end, containment can only hold if P is ground. Again, also in this case I think the proof can be adapted accordingly (but I might be missing something).

# NRF graph patterns and the complexity of Pattern Soundness:

What is the actual relationship between a graph pattern P, some corresponding NRF graph pattern P' (i.e., P' = NRF(P) if we assume NRF(P) to be a function returning a NRF version of P), and Soundness? I think it is the case that P is sound iff P' is sound? Maybe this relationship could be made explicit, and this might help to clarify the situation on the complexity of pattern soundness. In the first paragraph on page 14 it is stated that a polynomial number of completeness checks suffice to compute NRF(P). However, actually computing NRF(P) is a function problem, and actually looks to be P^NP (while NP is sufficient to check whether some triple pattern can be removed, NP is not powerful enough to decide whether actually all redundant triple patterns have been removed, i.e. that no more triple pattern shall be removed). Maybe the authors could elaborate in a little bit more detail how the NP membership can be established.

# Evalution of the implementation:

The results seem actually promising, and based on the chosen sources (for both, data and queries), the query sizes actually seem to be correct/justified. However, an average query size of 2.6 triple patterns feels rather small nevertheless.

### Typos/Minor Questions:

line 11: I would suggest to use "Knowledge Base" instead of "KB" (especially since I think this is the only occurrence for the next 20 pages) - but since I do not know whether "KB" is usually used instead of "Knowledge Base" without any introduction, I would ask the authors to just take this as a hint and decide based on their experience.

line 150 (p.3): "RDF and SPARQL, and completeness statements": I would suggest to either replace the first "and" by ",", or to replace the second "and" by "as well as".

line 183: (being a little bit picky here) saying that a BGP "consists" of a set of triple patterns is a little bit imprecise, since it's not completely clear what "consists" means. Why not just say "a BGP is a set of triple patterns"?

Preliminaries (Subsection 2.1): While the Preliminaries section introduce all major notions used throughout the article, there are a few minor problems:
- In this section, the notion of an RDF graph is introduced. However, starting with Section 2.2, this term is almost never used, but instead just "graph" is used (an exception to this is e.g.line 301 where RDF graph is used). It would be good if this was announced somewhere at the beginning of the Preliminaries (something along "in the following, we will just use graphs instead of RDF graphs"). Also, beginning with Section 6, the term "data graph" is used, making it three different notions for the same thing.
- The notions dom() and var() (first used in line 189) are not introduced. While their meaning is of course obvious, they are usually defined nevertheless.
- In line 197/198, the notion of a graph pattern has not yet been defined (definition is only in line 213)
- line 198: "[[Q]]_G is obtained": "the evaluation/solution/result [[Q]]_G is obtained". Also, G is not really defined at this place.
- As already mentioned earlier, the wording/presentation leaves the impression that the query language defined here is actually SPARQL - which is not the case.

line 226: "also negated version" -> "also the negated version"

Definition 2 (Completeness Statements): Is there a reason to define the completeness statements as Compl(P) for a nonempty BGP P, instead of just being P? Of course, in the end it does not make a big difference, but there are several occasions throughout the submission (e.g. several proofs), where having the "wrapper" Compl() just makes notation more difficult, since it is necessary to talk about "the BGP of the completeness statement", instead of just "the completeness statement". Is there any benefit in making this distinction?

line 307/308: "one mapping result": either "one mapping" or "one result" (mapping is probably more accurate)

page 5, second column, image at the top: It takes some time to understand what is actually shown in this image, that is, that the new boxes indicate which completeness statements guarantee which part of this graph to be complete. One problem is that the description of the graph comes a few lines after the graph itself, and then has no proper reference to the graph. Maybe make the graph a figure? This would a) allow to provide the most basic information in the caption, and b) to reference this graph at the correct position in the text.

line 347: "the (usa, lang, ?l)": the triple pattern (usa ..."

line 364: "no graph can contain more information about Q_3.": This wording here is a little bit misleading: Q_3 is ground and contained in G_cou. So of course no other graph can contain more information on Q_3, since for a ground query we have that it either maps into the RDF graph, or not. Once it maps into G_cou, what more information should a graph contain about this query?

line 390: "that the members are completely different items": what does this mean? (e.g., different from which items/elements?)

line 508: at the word "statements": there seems to be a rouge "}" between "e" and "m"

line 527: My understanding is that \tilde{c} represents an arbitrary value. However, the notation \tilde{} was introduced and used for the frozen values of a variable. While the frozen values are a convenient way to argue about arbitrary values, there is a difference between the frozen value and an arbitrary value. Thus I suggest to use a different notation for the arbitrary IRI here.

Section 3.3: Why not use a formal definition block/environment for the definition of query soundness - like it was done for completeness? Would be consistent with the previous section and would probably make the definition easier to read.

line 623 (approx.): "while the rest to the graph": this sentence should be checked. Maybe something like "while the remaining parts are mapped to the graph"?

line 660: Reading the paragraph starting at this line for the first time, it was rather confusing, since it is not quite clear what "repeated application", and thus terminating really means in this situation. As a result, also the reason why the 2nd case should lead to termination (and not just return the same result again and again) does not become clear. It makes more sense once one has read the formal parts on the next pages, but this paragraph does not really help in preparing for this. Maybe this could be clarified a little bit.

Table 1 (page 12): Why is "complete" abbreviated? There would be enough space for "complete".

line 879: "in this subsection": replace by "in this article", since the class of graph patterns was defined earlier and not in this subsection.

line 757: "In our algorithm": Strictly speaking, this algorithm has never been described. There is only Theorem 1, and from this a corresponding algorithm can be easily derived. However, since this is not done in the submission, maybe the wording here can be adapted accordingly.

Proposition 7 (but in general for the complexity results): It is slightly more common to not include the fixed parts of the input in the list of given items, but to precede them. E.g., "Let P be a BGP. Given a set C of completeness statements and an RDF graph G, deciding .... is NP-hard. The complexity remains the same even when G is also fixed". Of course, this is a matter of taste, but it makes it easier to grasp which parts are actually fixed.

line 880: "be monotonic": "is monotonic"

line 965: Is it necessary to introduce "sqsubseteq_s" for query containment under set semantics, or could \subseteq_s be used as well (usually, in the presence of set semantics, \subseteq is used for query containment) - but this of course is a matter of taste.

line 1001: Does this hold for arbitrary graph patterns or only NRF patterns?

line 1016: "or where the projection is over all variables in the positive part of the query body":
what is the benefit of allowing projection on variables from the negative part? By definition, they are always not mapped to any value, so they are not in the domain of any solution. Of course, because of exactly this argument, it does not hurt to allow them to be contained in the projection. However, mentioning this property there - even more, without explanation, is quite a distraction from the actual point made there.

second line of example 10: "it is impossible to right-shift any triple": personally, I think "there is some triple that cannot be right-shifted" is the much clearer and easier to understand sentence.

line 1050: "One can show that P is complete if and only if the positive part P+ is complete": Is this hard to show, or as obvious as it seems to be at first glance (in this case, maybe the statement could be adapted accordingly; otherwise a hint on why it is true would be helpful). Why not make this a formal result?

line 1058/1059: Teasing the reader that having a shared variable is important, but not saying anything more about this is ... a little bit cruel?

Equation (2): Strictly speaking, C_T has not been defined. Only C_\tau for \tau \in T was defined previously. However, one can of course argue that the definition via \bigcup_{\tau \in T} C_\tau is immediate.

page 17ff: Already earlier, starting with Section 6, there seems to be a slight change in the style the paper is presented. Especially on page 17, it is no longer the "formal result, then proof" style, but instead some (informal) arguments are provided first, then the result is stated. In case of the first equation on page 17 and its equivalence with [[P_\tau]]_{...} even without providing any formal result at all. Proposition 11 is stated as a formal result but instead of a proof also the arguments are made right before the Proposition. Maybe these results could be presented more in line with the previous results.

page 17, 2nd column: Why is the replacement of variables restricted to non-predicate values only (twice on this page).

page 18, 2nd column: Is the way of creating completeness statements as described here actually helpful in approximating the size of real world completeness statements? The authors report of several completeness statements in natural language for different data sources. How big are formal completeness statements expressing these completeness statements? Also, in the test setting, as a result of how they were created, the size of the completeness statements is very similar to the size of the queries. Is this a realistic scenario (in the examples throughout the article, queries rather had a size two or three times the size of the completeness statements)? Maybe the authors can comment on this.

Table 2: While the runtime for the first 6 lines is somewhat stable, it is significantly higher in the last line. Do the authors have any information on whether this is an outlier or the start of a significant increase in runtime compared to the size of the query?

References (page 25ff) are inconsistent, especially for conference publications. Some are very very minimal (e.g. [1]), others are very detailed (e.g. [9]).

## Appendix:

Why does each proof get its own appendix? This makes the structure rather unclear and results more hard to find. Why not something like one appendix for each section?

line 2146 (page 28): I think [[(P,\emptyset)]] should be [[(P,\mu_{\emptyset})]] ?

line 2186: "It is trivial to see that P is ground": "that then P is ground"

line 2189: Maybe the inclusion [[P]]_G' \subseteq [[P]]_G could be clarified by stating in addition that [[P]]_G' = [[P]]_G = {\mu_\emptyset}.

line 2205: I think (P,\mu_{\emptyset}) should be {(P,\mu_{\emptyset})} ?

line 2207: incomparable to what/wrt. what?

line 2211: By the first claim: By the second claim

line 2215: the second claim: the first claim

page 29, last line of second column: var(p_i): The notation var() is used throughout the article to denote the set of variables in "something", for example a BGP. While I agree that the overloading of this term here is obvious and does not really lead to confusion, maybe not overloading the operator is still advantageous.

page 30, second line of first column: "?\neg p_i" While of course this can be used as a variable name, it looks very strange, and I so far never encountered this notation. Usually, \bar p_i or similar notations are used, since this in addition helps to distinguish between the propositional variables p_i and the variables in the query. Of course, it is not wrong, but it looks very unfamiliar.

line 2257/2258: It would really improve the readability of the proof if, instead of repeating the definition of cruc_{C,G}(P) here, the actual value of it would be provided.

line 2324: Why is there such an emphasis on "directed" graphs? Is there any difference between the 3-col problem of directed and undirected graphs I am not aware of? And if not, I think it would be more natural and less confusing to use the usual 3-col problem, which, for my understanding, is commonly defined over undirected graphs (also, I don't see in the proofs where the direction of the edges is of importance). In case the directed version is of importance, maybe the authors could discuss this briefly (since, as I mentioned, I did not get it).

line 2325: (rather picky) If there is no citation for the complexity of QSAT given, I would say no citation for 3-col is needed as well (especially if the undirected version is used).

lines 2345ff: the proof proceeds a little bit quickly here. Maybe some additional details could be provided. For example, the BGP P_{col} is ground. Thus how does the frozen version of P_{col} differ from P_col itself?

line 2388 (or 2387): (again a little picky) The 'b' in "(a,b,c)"" probably is not intended to be the same 'b' as the one for blue in e.g. "(r, edge, b)". It's not a real problem since they never interfere, but it would be nicer to use different names.

around line 2395: It is "3-uncolorable" in the statement of the claim, and "3-incolorable" at the beginning of the proof of the claim (inconsistent).

line 2400: The fact that the cruc operator returns exactly triples(G_p) is actually independent of the assumption that G_p is not 3-colorable. However, being the first statement after the assumption left me a little bit puzzled since I tried to make the connection to the assumption. Actually, the same argument is also used in the other direction of the proof. Maybe it would be more intuitive to follow for the reader, if this fact was stated before the "Assume G_p is not ...". I.e., right after the first sentence, something like "First, observe ...".

Review #2
By Martín Ugarte submitted on 07/Feb/2018
Suggestion:
Major Revision
Review Comment:

Review for Ensuring the Completeness and Soundness of SPARQL Queries.

The paper discusses how to verify the properties of "completeness" and "soundness" of SPARQL query answers under so-called completeness statements. It builds on top of a series of notions and results previously published by the authors. After inspecting the previous work I believe the paper is a fair contribution, specially considering the work of putting all results together in a coherent story. However, and although the results are interesting, I did not find the paper easy to read. The notation is sometimes overly complicated, some notions are not formalized, and the authors present some involved concepts without providing intuitions. I do think, however, that the results are relevant to the Semantic Web community. In its current form I don't think the paper is ready to be published, but I would agree with the publication if there is a major revision (and the other reviewers agree, of course). Next I explain the reasons for this score, from the general comments to the more particular details. The general comments are the most important to address for the paper to be published.

########################
## 1. GENERAL COMMENTS: ##
########################

1.1) It has been discussed many times that the OWA in SPARQL corresponds to weak-monotonicity and not to monotonicity. There is a vast literature on this that the authors do not mention. I understand that under the restricted class of queries that they define the two notions coincide because mappings cannot be extended (there is no disjunction nor left-outer-joins), but this should be at least discussed. There are parts of the paper where disjunctions of results are considered, and therefore the discussion is not general enough for talking about the OWA. This is particularly misleading in the introduction, where there is no mention of the class of queries under consideration.

1.2) The authors define a very restricted class of queries but do not mention this. The class is presented simply as "SPARQL with Negation", but this is far from the general class of SPARQL queries with negation. In fact, the class is basically equivalent to UCQs with negation under a fixed FO signature. The problem of answering UCQs under a similar setting has been studied before by some of the authors, and is closely related to the problems treated in this paper. By using the proper relational signature it is clear that some of the results could carry over. For example, the completeness statements presented in this paper are equivalent to the table completeness statements defined in [Razniewski & Nutt, 2011], and the notion of "extension pair" presented here is equivalent to that of a "partial database". Also, the complexity results in the two papers are similar. This should be discussed in depth, and a full comparison with the relational setting should be presented. Currently the references are only mentioned very briefly in the Related Work section.

1.3) The authors claim to use bag semantics, but the definitions and notation throughout the paper correspond to set semantics. The same occurs with most proofs in the appendices. The authors take the formalization based on sets of mappings by Perez. et. al. (which should be referenced). Although set semantics are equivalent to bag semantics over BGPs, there are situations in which the queries under evaluation are more involved (for example, sets of partially mapped BGPs). It is not clear whether the authors are using sets or bags in those cases, nor if the results and proofs are correct for bag semantics (see for example comment 2.18). The authors should present a proper formalization based on multisets of mappings, but currently there is a big mismatch between the claim of using bag semantics and some of the results and definitions in the paper.

1.4) The paper is not easy to follow. Although the initial examples are very good and clear, from Section 4 to Section 6 there are several problems and they took me much more time than necessary to understand. For example, the definition of the "crucial part" of a pattern P w.r.t. a graph G and a set of completeness statements C (line 623) is presented as \tilde{id}^{-1}(T_C(\tilde(P)\cup G)). This is defined without any intuitions and, moreover, while \tilde{id} is a mapping, T_C returns graphs. Therefore \tilde{id}^-1(TC(...)) is not well defined. Although it can be deduced after some time that the idea is to transform a graph into a BGP, the authors should say this and mention that they are abusing notation. Also \tilde{id}, \tilde{P} and T_C are defined in-line some pages before without context, so the reader will not have the definitions in mind and will not be able to find them easily. This is a particular example, but it represents the fact that sections 4, 5 and 6 are hard to follow. I give more examples in the Particular Comments below.

1.5) Section 6 appears to have been written without enough time. Although the techniques might be relevant, they are presented quickly without much intuition or discussion, it is hard to follow all the ideas and why they could be good in practice. Furthermore, the techniques are heuristics rather than optimizations. There are almost no performance guarantees; some of the constructions involve exponential factors and there is little discussion about how the involved trade-offs can impact performance. Either a more detailed discussion should be presented, with clear performance guarantees, or this should be discussed as a set of practical considerations. In its current form the involved notation and formalization of operators only makes the read more difficult.

1.6) As presented, the experimental evaluation in Section 7 does not have a clear objective. One can find repeated occurrences of statements like "which is reasonably fast", "which is relatively slow", "can still be done relatively fast", but there is no comparison to other approaches (so there is no relative scale) nor a defined notion of what is "reasonable". This should be clearly stated, specially because the class of queries under consideration is very restricted. The Related Work Section cites several systems that have been proposed before, and it is not clear why the authors do not compare against any of them. The only mentioned objective is to see if the techniques presented in Section 6 can provide a speed-up, but this will probably be the case if one compares against the brute-force computation of all equivalent partial groundings. In the comments below I only include details regarding the experimental setup, as I think the complete section needs a major refactoring.

###########################
## 2. PARTICULAR COMMENTS: ##
###########################

2.1) L 100: Do you refer to a syntactic form of negation? The introduction of negation by means of OPTIONAL should be mentioned here, as well as weak-monotonicity.

2.2) L 146: What does it mean to give a "partial characterization"? The reference [8] is only cited here, and it seems to be very relevant to see how much of this paper is novel. This should be discussed in detail.

2.3) L 190-194: "the" freeze mapping should be "a" freeze mapping, as there are infinite ways of constructing it (by choosing different sets of IRIs). The same for "the" prototypical graph. Also \tilde{id} should be parametrized by P or var(P), but the notation is confusing, specially when defining the "crucial part" in line 623. These definitions could also be in a definition environment, as they become important elements later and currently it is hard to come back and find them.

2.4) L 197-203: As presented, it seems that you are using a mix between bag and set semantics, because the BGPs return sets but projections can produce bags. Again, because of the limited class of queries under considerations bag semantics and set semantics coincide, but the formalization is based on sets and it works only because, for the class of queries under considerations, set and bag semantics are equivalent.

2.5) L 279-282: For clarity, it would be good to discuss why do you need a set of completeness statements instead of using a single statement.

2.6) L 405: The definition seems to be unnecessary and only complicates notation. In particular, it would suffice for Definition 5 to finish with "it holds that [[ P ]]_G'\subseteq [[ P ]]_G'" instead of "it holds that (G,G')\models Compl(Q)".

2.7) L 405: Despite the previous point, definition 4 is insufficient in terms of notation because of bag semantics. The problem is that saying that the result of Q evaluated over G' "also appears in Q over G" is ambiguous. Each mapping should appear with the same multiplicity? or is it enough that a mapping appears, irrespective of the multiplicity? or one multiplicity must be greater than the other? The fact that this is presented with subset notation is confusing. It should be very clear what is the nature of the elements in [[ Q ]]_P.

2.8) L 432: Related to the previous point, it is not clear what do you mean when saying "we can focus on the BGPs used in the body of queries for completeness statements".

2.9) L 445: Proposition 1 seems to be a more complicated way of stating the same, and has no application in the paper. It is only mentioned once more to say that it does not provide a concrete way to compute a set of mappings to check for completeness. Therefore it does not really provide a useful characterization, I don't think it should be in the paper.

2.10) L 551 & 568: The definition of soundness entailment for an extension pair is again unnecessary, it can be directly defined over graphs a completeness statement (both for mappings and BGPs).

2.11) L 578: Proposition 2 is the same as the definition but replacing the answer set for "al mappings"; it does not seem to be a useful fact. It is only mentioned once more in Line 1591, where you could replace "by Proposition 2" for "by definition".

2.12) L 596: When you say "Furthermore, we define the evaluation of a set of partially mapped BGPs over a graph G as the union of evaluating each of them over G", do you mean the set union or the bag union? This is important, because in the rest of the paper you are considering that the answers to BGPs are sets (as discussed before). Also, since in a partially mapped BGP (P,m) you require that var(P) and var(m) are disjoint, this property still holds. However, there is no restriction for two different partially mapped BGPs to produce different answers.

2.13) L 618: Again, it is not clear if the extension to sets of partially mapped BGPs is under set or bag semantics.

2.14) L 620-625: From Example 3 to Example 4 (included) the provided intuitions are very hard to follow. I think this part requires a complete example, showing a completeness statement that is mapped to a BGP and to a graph. In particular, it is hard to see why do you need the graph G, which, if I understand correctly, is required only because the the completeness statement could join parts of the query with parts of the graph. Also, it should be explained what is the meaning of \tilde{id}^{-1}(...); as mentioned in point 4) of the General Comments, this is an abuse of notation that makes reading the definition very hard. I really think that this part is highly relevant and is not explained with enough details and intuitions. Specially because the epg operator is based on this.

2.15) L 680: The three cases presented again lack an intuitive explanation. It is hard to understand what is going on with the underlying partially mapped BGPs without having a complete example at hand. Example 4 is hard to read because one needs to go back and forth in the paper looking for the different elements, which again are not easy to find. Section 4.1 would benefit from a fresh example that explains step by step the repeated applications of the epg operator.

2.16) L 728: The first property of Proposition 4 is basically Proposition 3.

2.17) L 800-825: The fact that a problem is complete for a complexity class for a particular fixed graph does not make true that if you fix any graph the problem will be complete for the class. This occurs in Propositions 6, 7 and 8. For example, for the empty graph the entailment problem is not \Pi_2^P complete, it is in fact trivial. This should be stated correctly or made more precise, perhaps studying the problems under parametrized complexity.

2.18) L 934: Proposition 10 seems to be incorrect under bag semantics. Take for example C = P = (?x,a,b), P'=(?y,c,d), G={(1, a, b), (2, c, d)}. The mapping {x->1} occurs in [[ (var(P), P\cup P' ) ]]_G with multiplicity 2, while it appears in [[ P ]]_G with multiplicity 1, and therefore the "multiset subset" relation [[ (var(P), P\cup P' ) ]]_G \subseteq [[ P ]]_G does not hold. However, it is clear that \tilde{P}\subseteq T_C(\tilde{P}\cup\tilde{P'}). This shows the importance of a formal model for bag semantics.

2.19) L 943: Lemma 2 suffers from the same problem as the previous point.

2.20) L 979: The negative part of a BGP is simply a CQ, and what you are doing with the non-redundant form is reducing the CQ to its homomorphic core. There are well-known algorithms for computing this, as well as results regarding complexity that should be pointed out.

2.21) L 1054-1059: The statement about the extension to MINUS is not immediate. The semantic difference between MINUS and NOT-EXISTS is very peculiar. For BGPs they are almost equivalent, except for the case in which the positive and the negative part do not share variables. It might be true that the results hold, but this should either be treated carefully or simply removed.

2.22) L 1084: The "Data-agnostic Completeness Checking" subsection is a simple heuristic. There are no guarantees that the presented hashmap-based technique will perform well. Apart from hashing each set of terms, for each NOT-EXISTS pattern P', one needs to do 2^n lookups where n is the number of terms occurring either in P' or in the positive BGP. Depending on the number of completeness statements and the size of the pattern this might be a good idea, but I think it should just be mentioned as a heuristic in section 7 rather than an optimization technique.

2.23) L 1196: Example 2 tries to illustrate prioritized evaluation, so the use of a template only makes things more complicated and does not add to the intuition. Also, over the presented graph G_{org} the template (C, {?org}, \Omega) is equivalent to C, which makes things confusing. You could just evaluate ((?c, lang, ?l), (org2, founder, ?c)) instead of going through the definition of P_{\tau_{org}}. If the idea is also to illustrate how P_\tau is constructed then it should be presented.

2.24) L 1300: The term "base" is confusing. You say that the set of BGPs acted as a base, but then you do things for each "base". Why not just talk about a set of BGPs, and what you did with each BGP?

2.25) L 1301-1321: The presented information is not enough to have a good picture of the possible bottlenecks in the performed experiments. For example, whether you are using an SSD, or what is the size of the Wikidata graph (to see how much of it fits in main memory), are factors that highly affect the experimental evaluation. Also, the way in which queries and statements are constructed is arbitrary and unclear. For example I cannot find a precise meaning for "we generated queries by instantiating the query bases with these projected mappings".

#########################
## OTHER DETAILS & TYPOS: ##
#########################

3.1) L 61: "..., that is, the checking whether..." -> "..., that is, checking whether..."

3.2) L 62: To say "Nonetheless, their approach is limited..." when that approach was presented by authors of the current paper sounds strange.

3.3) L 100: "..., answers of queries with negation..." -> "..., answers to queries with negation...".

3.4) L 153: "In Section 4, we introduce formal notions, and then...". What is the purpose of these formal notions? You introduce formal notions throughout the paper.

3.5) L 172: It should be mentioned that you are adopting the formalisms by Pérez et. al.

3.6) L 221: Why is this notion of "consistency" relevant?

3.7) L 283: The transfer operator could be in a definition environment, it is important afterwards and currently hard to find. Maybe it could be defined later, before it is used.

3.8) L 687: It is confusing to use \tilde{P} here, because its definition uses fresh IRIs which should not occur in G. This could be restated.

3.9) L 728: The st "sat(P,C,G)" should be defined in the text or at least referred, it is not immediate to find it as the name of the algorithm.

3.10) L 879: "We only need that the positive part of the pattern be monotonic." -> "We only need the positive part of the pattern to be monotonic."

3.11) L 884: "..., the check whether an answer is sound..." -> "..., the check of whether an answer is sound..."

3.12) L 1001: "...the check whether..." -> "...the check of whether..." or "...to check whether..."

3.13) L 1025: "...when evaluated of a graph G." -> "...when evaluated over a graph G."

3.14) L 1085: "...to the (data-agnostic) check whether..." -> "...to the (data-agnostic) check of whether..."

3.15) L 1108: "...are potentially much fewer than..." -> "...potentially much less than..."

3.16) L 1237-1252: I don't think the operator \sigma requires that amount of space, you can just mention that it replaces each variable with _var in BGPs, and give the small example.

3.17) L 1320: "...with Intel Core i5..." -> "...with an Intel Core i5..."

3.18) L 1325: The query "length" should be defined before mentioning it.

Review #3
Anonymous submitted on 19/Mar/2018
Suggestion:
Accept
Review Comment:

This manuscript introduces a conceptual framework for identifying the completeness and the soundness of SPARQL query results based on so called completeness statements, that is, metadata that indicates which parts of an RDF graph are complete regarding the information captured by the graph. In more detail, the contributions of the manuscript are: a formal foundation of the framework, complexity results regarding the main decision problems in the framework, a baseline algorithm to determine completeness (and soundness) in the context of the framework, optimization techniques to improve the algorithm in terms of performance, and an experimental evaluation that demonstrate the effectiveness of the optimization techniques.

The problem addressed in this manuscript is highly relevant to the Semantic Web community and the proposed framework provides an elegant approach to address this problem. The foundational parts of the manuscript cover all relevant aspects of the approach, and I did not find any flaw in the formalization or the results. The proposed optimization techniques are perhaps not too surprising, but they are effective (as demonstrated in the evaluation) and, thus, a welcome and important addition to the presented framework. The experimental evaluation is reasonable for this type of work and it verifies the authors' claims regarding their optimization techniques; additionally, it provides the reader with a good understanding of the run-time characteristics that completeness checking and soundness checking may have in practice. Taking all these points together, the manuscript presents a great and comprehensive piece of work that combines a solid theoretical foundation with an empirical study of the proposed optimization techniques.

Besides the work presented in the manuscript, the manuscript itself is also of high quality. Although, given the nature of the presented work, the manuscript is all but light reading and requires some effort from the reader to digest, the authors did a great job guiding the reader through the material; that is, formal definitions are accompanied with good examples and the intuition of the various concepts is introduced informally in an easy-to-understand manner. Moreover, the organization of the manuscript is logical and predictable (in a good sense), which makes understanding the framework and the presented work easy.

I only have a few minor suggestions on how the manuscript can be improved (see below), and these points are easily implemented when preparing the camera-ready version of the manuscript. Therefore, I propose to accept the manuscript for publication in the journal.

My suggested improvements are:

* introduce var(P) in Section 2.1

* In the sentence before Example 1, it should be "satisfiable" instead of "consistent."

* The last paragraph of Section 2.2 should be reorganized as follows: The sentence that describe the intuition of the transfer operator (i.e., the third sentence) should come before the sentence that defines this operator.

* Related to the previous point, I would like to see an example of the transfer operator before the last sentence of the paragraph.

* For defining the notions in Section 3.3, use definition environments as has been done for the corresponding completeness-related notions in Section 3.1.2 (i.e., like Definition 4 and Definition 5).