On minimal intermediate results in zero-knowledge federation

Tracking #: 1476-2688

Jonas Halvorsen
Audun Stolpe

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
This paper is concerned with limiting the size of intermediate results in zero-knowledge SPARQL federation, in order to lighten the load on local joins and remote semi-joins. The main contribution is threefold: first, an abstract characterization of pruning operations in general is given. It consists in a set of algebraic laws that are deemed to constitute intuitively plausible requirements. Secondly, instances of pruning operations are defined that, if a result-set is viewed as a table, reduces the number of rows and the number of columns respectively. There two principal operations in question are shown to yield row-minimal and column-minimal intermediate results, where minimality is measured wrt. the equivalence of the root of an evaluation tree. Finally, we provide a negative results in the form of an impossibility theorem. It shows that no pruning operation yields results that are both row-minimal and column-minimal. It may be inferred from this that overall minimality is not as simple to obtain as to apply a column-minimal pruning operation to the outputs of a row-minimal one or vice versa.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 02/Dec/2016
Major Revision
Review Comment:

The paper considers the issue of blank nodes in federating SPARQL queries.

First of all, it is not very clear what is the language (fragment of SPARQL) considered in the paper.
Indeed, in Section 3 it is first said that the select-project-join fragment is studied, which is the fragment of basic graph patterns (BGPs). Then, BGPs are defined as non-empty sets of triple patterns from (ILV) x (IV) x (ILV) (in fact, a more complex recursive definition is given for BGPs, but the reason for this in not very clear, because it is equivalent to the one I give). Anyway, I’m quite sure that projection is not expressible by any semantics of such BGPs I’m aware of, including the obscure semantics given in this paper (see more about the semantics below). Usually, projection is expressed in SPARQL by means of SELECT/CONSTRUCT/ASK clauses; an alternative, more restricted way is blank nodes in BGPs; though, none of these is mentioned in the paper. In Section 3.1 the semantics is intended to be described. First, it is said that the set of answers to a BGP P over an RDF graph G is the set of solution mappings from the set of variables to terms that map P to G. This is not very formal, but understandable. But, suddenly, a join operation is also defined on answer sets, which does not have any counter-part in the syntax of the language. Moreover, then some notion of ‘concept’ comes into the game, and three very confusing definitions are given. According to Definition 3.1 a context c is a bijection from ILB to ILB_c, but in Definition 3.2 it is applied explicitly only to B. Why is it defined for IL then? Definition 3.3 is just nonsense — first it introduces \mathcal G, which is however never used, and then explicitly defines the same thing [P]^c_G in two different ways! I’m really puzzled here. In general, I have an impression that these contexts correspond to different RDF graphs (however, it is never said), and then different parts of the BGP may be evaluated over these graphs. This clearly contradicts SPARQL specification, where a BGP must be evaluated over a single graph (there are other mechanisms to evaluate different BGPs over different graphs and endpoints, namely, GRAPH and SERVICE constructs; confusingly, the last is called federation).

So, the language considered in the paper is not properly specified and does not align with the SPARQL specification. I understand that the language is introduced in the previous work [17] and is not the topic of this paper per se. But I do not consider such an approach acceptable, especially for a journal paper, where the space limitations are not that strict: such a publication would be essentially useless for any reader. Even worse, the mismatch with the SPARQL specification is (almost) not mentioned — I only found a hint in the introduction, where it is said that a generalised SPARQL semantics is given in [17].

First, I wanted to stop reading here, with a recommendation to resubmit with properly formalised settings. But then I’ve decided to read a little bit more. I’ve found, that at least few more pages of the paper is written in the same very imprecise and confusing manner, which makes it almost not possible to understand. I list some examples below.

Theorem 3.1 looks more like a definition, not a theorem: a notion of entailment is not given anywhere, but used here. Also, it has neither a proof nor a citation where a proof can be found.

Theorem 3.2 mentions ‘standardised apart’ graphs, which are never defined anywhere.

The last sentence in page 5 refers to blank nodes in triple patterns, which does not make any sense by the definition.

Theorem 3.4 is obviously wrong. Indeed, let I = {1}, G_1 = {(a, a, a), (c, c, c)} (I hope this graph alone is ‘standardised apart’, even if I do not know what does it mean), let P = {(?x, a, ?x), (?y, c, ?y)}, and let \mu_g = [?x -> a, ?y -> c] (there are no blank nodes here, so g does not play any role, same as c_i). Then, by definition, f(\mu_g, P) = P_1, P_2, where P_1 = {(?x, a, ?x)} and P_2 = {(?y, c, ?y)}. Therefore, there is no any subset of I with the required properties, simply because any subset of I has at most 1 element, while the decomposition requires at least 2, one for each of P_1 and P_2.

The first sentence of Section 4 says that ‘talk’ and ‘binary trees’ are indistinguishable from now. Just nonsense. Actually, the same as the whole paragraph, where expressions, trees, nodes, sets of mappings are just mixed up.

Definition 4.1 talks about trees with labelled nodes. But, suddenly, in the second alternative, it compares a label, \Omega^k, with a node (leaf), [P]^…_… It is a very strange tree, if it can have nodes and labels of nodes in the same domain. I really require to write a definition of these trees.

Definition 4.2 looks overcomplicated: it first defines t[\mu], then defines \mu(t) = t[mu], and then uses \mu(t). Why to introduce t[\mu] at all, why not to define \mu(t) from the beginning?

Lemma 4.1 has a type mismatch: \equiv^P is defined only for [complete] evaluations of P, not for arbitrary subsets of these evaluations, which may be the results of o(\Omega^i). Of course, \equiv^P may be generalised to such subsets and I may think that it is implicit. But then the lemma is weird:
it talks about a row operator o, and says that if two sets of answers, which are results of o, are equal, then they are equivalent according to \equiv^P. However, the operator o seems to be irrelevant here: the equality trivially implies the equivalence for any sets of answers to P just by definition (as it should be for any reasonable equivalence), and I do not think any proof is required here.

Here I stopped definitely.

My conclusion is that the paper may contain some interesting ideas, but in the current conditions this is not possible to understand and judge.
To be accepted, the paper should be completely rewritten in more technically accurate way: for example, the studied language should be formally defined syntactically and semantically, and placed in the state-of-the-art landscape of languages for RDF (in particular, compared with normative SPARQL~1.1); also, the definitions and statements should be consistently formulated and proved in terms of previous definitions.

Review #2
Anonymous submitted on 29/Dec/2016
Major Revision
Review Comment:

In this paper the authors study the evaluation of federated conjunctive queries (CQs)
over RDF datasets containing blank nodes. The main objective of the study is to
determine the minimal amount of information that needs to be transmitted in order
to correctly answer the query in this setting. To this end, the authors propose
several methods for removing results of different triple patterns forming the query,
or by reducing them in case some variables they bring will not form part of the output.

Overall, the setting of the paper is quite interesting, and the main problem that is
tackled is relevant for the Semantic Web community. The results, although mostly straightforward, do shed light on how to minimize the amount of information that needs to be transmitted, and the operators the authors propose for this task could prove useful in practice when one is dealing with blank nodes over distributed RDF sources. However, the presentation of the material has a lot to be desired, and in its present state the paper is simply not ready for publication.

The main complaint I have about the paper is regarding its lack of mathematical rigidity,
which is crucial here, considering that the main aim of the manuscript is to give a formal framework for dealing with blank nodes under federation and *prove* how various operations can make CQ evaluation efficient in this setting. Probably the biggest issue here is that many terms are left undefined, and the reader has to either look them up in the previous literature,
or figure them out from the proofs. At times it almost seems as if the authors assume the reader to be familiar with all the terminology in this setting, which I find rather unreasonable for a journal publication that is meant to be self-contained. The best illustration of this is probably the fact that it is never formally defined what it means to answer a CQ in the federated setting (see more detailed comments about Sec 3 below).

Next, I have some reservation about certain proofs. And while the validity of the results whose
proofs are in question is easy to verify, I find it somewhat troubling that the authors do not seem to be aware of the result they are trying to prove (see e.g. comments about Lemma 6.6).

Finally, the lack of examples is quite evident in the first part of the paper when the terminology is introduced, thus requiring the reader to go over the text several times in order to get the idea of what is going on. I believe that adding examples would greatly improve the readability of the manuscript and significantly improve the presentation of the material.

In conclusion, in its present state, the paper is simply not rigorous enough to be accepted for publication. Missing definitions, incomplete proofs, and unclear notation being the main detractors here, with the rather poor readability coming at a close second. However, one has to note that the results of the manuscript are indeed quite interesting and might provide useful in practice. Therefore, should the aforementioned issues be fixed, I will be glad to give the paper another careful read. My recommendation is to revise and resubmit the manuscript. More specific comments follow.

Main issues that need to be fixed:
1. The context of CQ evaluation under federation is never formally defined. In particular, I found it difficult to understand at first if all the patterns are to be evaluated at all the sources, if the user predetermines which pattern is evaluated where, etc. For a theoretical paper these notions should be defined formally and illustrated by an example. I believe that this was in fact done before in the paper "Semantics and optimization of the SPARQL 1.1 federation extension" (Aranda, Arenas, Corcho), and the extension to the notation of this manuscript should be rather straightforward. I believe that this will be much more readable than the explanation in Section 3.3, which I had to read several times before understanding what the setting studied in the paper actually is. (In fact, the best explanation comes at the very end in Section 10.)

2. Many concepts are not defined well, or are assumed to be known from the previous literature. For instance, homomorphisms between RDF graphs, what it means to be standardized apart, the fact that all blanks are mapped to the same term (Examples 6.3 and 7.3), all of which makes it difficult to follow some proofs. The concepts which are used in the paper should be defined formally.

3. Lack of examples. As discussed above, this is quite problematic, as it makes it difficult to follow the text. Starting form the introduction, examples should be introduced to either illustrate the point authors are trying to make, or to help understanding formal definitions. The previous paper of the authors' in this same journal seems to do a much better job at this. Since the lack of space is not an issue here, this should be fixed, even if repetition is required.

4. Problems with some proofs. These need to be fixed before the paper is accepted (details follow below).

5. Writing in general needs some work. While the English is not the problem here, the way ideas/concepts are introduced is at times not visible, since they appear inside text. Perhaps once all concepts are properly formalised this will cease to be an issue.

Specific comments section by section
[cited text from the manuscript is inside quotation marks ""]:
### Introduction:
Generally quite good, but an example illustrating the points made on page 1/column 1 of page 2, would greatly improve readability. Also, the term zero-knowledge is not that standard, so I had some problems understanding what is going on during a first read. Perhaps this could be expanded a bit.

"oft" -> often
"documents containing RDF collected" - rephrase
"is is"
"enjoys pride of place" - rather archaic

### Section 2:
I find this section somewhat redundant.

### Section 3:
"Select-project-join" - this is not really the fragment used in most of the paper; it is just join for row operations, and then select-join for column ones

The notation P_i \cup P_j is a bit non-standard as it defines a join; perhaps stress this as to not confuse the "database reader"

Start of 3.1. - perhaps a bit more formality here; define compatible mappings, etc.

Def 3.2 - in both cases this is c(\mu(u)); but if you want to stress the different treatment of blanks OK

Def 3.3 - the set \mathcal{G} is not linked with the G below.

Text after Def 3.3, while intuitively clear, is not linked to the definition, as the latter uses a single graph at a time. Formalising this and/or using an example might be of help here.

Def 3.4 -- the formulation is ambiguous. Does every blank get mapped to the same term or not? Also, saying "any blank node" could signify that the mapping might be partial on blanks. See comments about Section 6 and 7 below.

Thm 3.1 -- homomorphism between RDF sets not defined previously.

Thm 3.2 -- not defined what "standardized apart" means. Furthermore, it is still confusing at this point how federated answering works. Does the user define the graph/context where the pattern is evaluated or not? This should be clear from the start. Therefore: examples + the formalisation of the setting are needed previously.

Def 3.5 + 3.6 -- examples needed; also, the notation f(\mu_g,P) should be formally introduced: i.e. let f be a function assigning...

Thm 3.4 + section 3.3 -- notation is quite bad, as c_m,...,c_n is not a good enumeration for P_1,...,P_k; changing m for 1 and n for k is much easier to follow.

Def 3.7 -- Perhaps if the setting is defined by attaching to each subquery P_i the IDs of sources it is identified over, then all \mathcal{G_i} can be made equal G, with evaluation being empty when the source is not in the list. This depends on how the whole setting is formalised; but a good option is to replace P_i with (S_i,P_i), where S_i is the list of sources P_i will be evaluated over. The entire query is then join of all P_is, etc.

Sec 3.3. Overall, after reading this I was completely lost as of what the setting being studied actually is. This is why formalisation and examples are needed.

"As regards RDF specifics" - rephrase

### Section 4:
Main comments:
a) The notation should be fixed and formalised. More details below, but the notion of trees, how they decompose, etc. should be defined in a clear way.
b) Maybe stress a bit more that column conditions require equality, while the row ones require only equivalence. This is later used quite a bit.

Def 4.2 -- t[\mu] the definition should be the other way around

Def 4.3 -- need to specify what P,p,\Mu are; also, "We write ..." should say that the symbol denotes equivalence

Both definitions need examples.

Lemma 4.1. Column/row operators should be defined formally... i.e a function from to...; Also, why is a proof needed here. Equality will always imply equivalence.

Def 4.4. -- evaluation tree and join not defined formally before (i.e. is this the join of trees. results, roots?); also, it needs to be commented how a single triple pattern can be viewed as a join

root-pattern not defined

before Def 3.5 -- L/R should refer to left and right subtree, but is not mentioned

before Def 4.6 -- maybe add a footnote emphasising that OPTIONAL is not considered in this paper

Def 4.6. columns/rows are reversed; also , put an example

R-padding - o(o(\Mu) join o(\Mu))

Def 4.7 -- I am assuming this is intended to be computed bottom-up (i.e. starting at the leaves)

Thm 4.2 -- r(T) is not defined; I am assuming it is the root

C-padding -- maybe use a different letter for row/column operators

Def 4.10 -- it is somewhat strange that this is the first place projection is mentioned; maybe introduce this in Sec 3? Also, these would be select variables in the SQL notation.

Corollary 4.6 -- the statement of the corollary is incomplete. As it is, it is identical to Thm 4.5, so a clarification of what omegas are is needed

"algebraic expression" - a bit of an overloaded term
"will be treated denoted"
"what requirements that it is reasonable"

### Section 5:
This section is generally much better written than the previous ones, providing examples and clean proofs (one minor omission). Some formalisation is still lacking though.

Def 5.1 -- Omegas should be all the nodes, not just leaves

Ex 5.1. Why introduce \mathbb{B} for blanks here; better just continue using B?

p12, first paragraph "kons on blank nodes are bracketed" -- what is this supposed to mean?

!!! Lemma 5.1. -- induction step "essentially by repetition" -- this is not true, as one needs to expand the definition of JV through union. This should be added to the proof.

Sec 5.1 -- first paragraph should be illustrated by an example

Lm 5.2 -- the notion of filter, R, \sigma_R should be defined properly; the result is correct though

Also, filter is an overloaded work in the Semantic Web/relational DBs context, so it is not clear to which one the paper is referring.

before Lm 5.3 -- filters need to be introduced formally; in particular the comment about conjunctive normal form is pertinent, as without it the proof fails

Lm 5.3 -- second to last paragraph of the proof: second sentence is not needed.

before Lm 5.5 "with the provisos gone" -- not the best way to state things

Lm 5.5 -- one wants to prove equivalence really (R-distributivity). Please make more clear; maybe say in the proof that something stronger can be shown (C-distributivity)

Thm 5.6 -- notation lacking; Omega is a tree, Omega_1 left child, etc. In the proof \mu_1, \mu_2 in the second to last paragraph are \mu_i and \mu_j.

As a side note, the proof of this theorem is the first place where it is precisely clear what the setting is (i.e. each P_i is evaluated on different source). Hence my insistence on formalising the setting at the beginning.

### Section 6:
Main comment: the entire section assumes that all blanks get mapped to the same constant. This should have been formalised when defining mappings and homomorphisms. Also, please remark that this is not how blanks are usually treated in the literature (also, in the relational setting there is the notion of named nulls, which intuitively allow mapping different nulls to different constants).

Thm 6.1 -- maybe call a proposition

below Thm 6.1 "in Figure 4: the second and third row are equivalent" -- it was not defined what this means. In particular, if in Def 3.4 we allow blanks to be mapped to different constants, this is not true. These things need to be clear from the start.

below Def 6.1. -- the notion of P-shaped graph should have been introduced formally before

Lm 6.4. While easy to guess, the notation \Omega(P) should have been introduced formally in Def 4.2.

Ex 6.1 The mappings h_1 and h_2 should be given explicitly, together with their inverses. In line with the comments above, it is not clear how _b1 is to be treated. Assuming it is a blank (which would be _:b1 in the notation of the paper) Figure 6 provides some insight into this, but the example should be redone to reflect all of this. Also, \beta{\Omega} = \{\mu_1,\mu_2\} if I'm not mistaken.

Lm 6.5 -- I am not sure if "preserves" is the best word here. Also, it might be good to repeat here how a result set \Omega is connected to the tree T_\Omega.

!!! Lemma 6.6 -- While the result is trivially correct, the proof does not hold up. In particular, when \Omega is singleton, one can not assume that there are two mappings in \beta(\Omega). In fact, even disregarding this, the second sentence of the second paragraph does not follow (not in all cases anyway). On the other hand, the proof is not really needed here, as the claim is that if some equivalence class is missing one can not cover the entire set being partitioned by the relation, which is trivially true for any equivalence relation.

Ex 6.3 -- \mu_1 \cup \mu_2 is not compatible itself; the two mappings are.

### Section 7:
My main problem is with the proof of Lemma 7.3. Perhaps I am missing something here. Also, Example 7.3 really illustrates why all of the notions should be defined formally in a clear way.

Second paragraph -- "It enjoys local minimality property" -- this was not proved nor discussed anywhere

Third paragraph -- "compatible choice functions" should be defined formally in line with Ex 6.3; also, the row minimality comment is very vague

Thm 7.2 -- in the proof one should use g(\overline{\mu}) and not just g(\mu) in many places; also, \mathbb{B} is used for blanks, even though B is the notation introduced in Section 3.

!!! Lemma 7.3 -- I do not follow why \overline{\mu}=\{\mu\}. Was this proved somewhere before? As far as I see it, \overline{\mu}=\{\mu\} only when there is at most one blank node and is generally not true.

!!! Ex 7.3 -- \mu_1' \nsim_P_1 \mu_1'' is confusing, as in Ex 6.3 a different notion is used, namely that all blanks go to the same constant. It seems that here we are assuming that a blank is mapped to a constant only if the constant is also present inside the same mapping, which is something that should be formally defined from the beginning.

### Section 8:
This section and the following one are much more clear than the rest of the paper.

Thm 8.3 -- V\subseteq Var(\Omega^1); also, although the proof is correct, I think one can prove this easier by applying Thm 8.2

"falls out" ->follows

Thm 8.5 -- maybe explain previously what does a non-trivial extension mean.

### Section 9:
Again, very nice and clear section in general.

Def 9.1 - I am not convinced that this is the best notion. It can be the case that triple pattern contributes with a variable ?x, but contains one not inside X as well, thus excluding it from the set. E.g. if X={?x} and P = (?x ?y a) (and similarly when more triples are present). Also, why is this notion needed really? All the proofs work without it anyway.

"legit" is a bit informal

Thm 9.3 -- o(\Omega) in lines two and three of the first equation in the proof.

"a paper sized package" -- strange way of putting it.

Review #3
By Olaf Hartig submitted on 03/Jan/2017
Major Revision
Review Comment:

The paper presents a thorough theoretical study of logical operators in a federated SPARQL query execution plan. The aim of these operators is to reduce the size of intermediate results during the execution of a (conjunctive) SPARQL query over a federation of SPARQL endpoints. In particular, the authors study so-called row-pruning operators that remove entire solution mappings from intermediate results, as well as column-pruning operators that project out variables from all mappings of an intermediate result. The perhaps most important assumption of this study is the use of set semantics (that is, for the standard SPARQL bag semantics, the studied operators would be incorrect). Then, the studied row-pruning operators exploit a potential existence of blank nodes in intermediate results, whereas the studied column-pruning operator exploits the fact that some variables, after being used as join variables in earlier stages of the query execution plan, are not used anymore in later stages of the plan (and are not needed for the final projection operator).

Research on the theoretical foundations of efficiently querying a federation of SPARQL endpoints is still rare. Hence, work such as the one presented in the paper is a welcome contribution. Despite its abstract nature, the paper is easy to understand if the reader is willing to put some effort into it. The presented work itself is formally correct, and it is presented in a sufficiently precise manner (with a few exception as mentioned below).

Unfortunately, while the formal framework and the results are correct, I do not see any practical relevance of these contributions. It appears that the whole theory presented in the paper is based on the notion of an evaluation tree with so-called "b-components" as leaf nodes. More precisely, the studied operators are defined based on such an evaluation tree and the proofs depend on this notion. The issue is that the decomposition of a BGP P into such b-components is defined in terms of a solution mapping \mu_g in the query result of P over the federation (more precisely, over the union of the RDF graphs of the federation members)--see Definition 3.6. Hence, identifying the "b-component[s] of P relative to \mu_g" (Def.3.6) requires computing \mu_g beforehand, which, in turn, means computing at least a partial query result of P over the federation beforehand. Then, the question is: of what use is an evaluation tree for computing P over the federation if the result of this computation must already be available (at least partially) for creating the evaluation tree?

In addition to this paradoxon, there is a perhaps even more problematic issue with the fact that the theory is based on an evaluation tree made of "b-components" that depend on a single solution \mu_g; namely, the decomposition of a BGP P into b-components may differ depending on which solution \mu_g one chooses, and each decomposition may only be used to produce a subset of the (complete) result of P over the federation. Here is an example: Assume the following BGP:

P = { (u1,p,?x), (u2,q,?x) } .

Furthermore, assume a federation consisting of the following two RDF graphs:

G1 = { (u1,p,u3), (u1,p,_:b1), (u2,q,_:b1) } ,
G2 = { (u2,q,u3) } .

Then, the complete result of P over the federation (i.e., over the union of G1 and G2) consists of two solution mappings:

\mu_a = { ?x -> u3 } ,
\mu_b = { ?x -> _:b1 } .

Now, if we choose \mu_a to decompose P, we obtain two b-components:

Pa1 = { (u1,p,?x) } ,
Pa2 = { (u2,q,?x) } .

However, if we choose \mu_b instead, we obtain one b-component:

Pb = P .

Since each b-component is to be evaluated separately by each federation member (see Definition 3.7), the second decomposition (whose evaluation tree has the evaluation of Pb as its only leaf node) cannot be used to produce solution \mu_a. Moreover, with the first decomposition (whose evaluation tree has two leaf nodes, generated for Pa1 and Pa2, respectively) we fail to produce solution \mu_b if we apply the operators proposed by the authors (in particular, the filter operator introduced in Sec.5). Consequently, both evaluation trees have to be used and their results have to be union'ed to be guaranteed to obtain the complete result consisting of both \mu_a and \mu_b.

Hence, unless I miss something that is not obvious from the paper, it seems that, in general, all possible decompositions of a given BGP have to be considered (and the results of their respective evaluation trees have to be union'ed) to achieve completeness. Considering all possible decompositions, in turn, means to consider all solutions of the complete query result because each such solution may result in a new decomposition not seen before. Then, again, we have the paradoxon: why would we want to generate decompositions and evaluation trees to be used for computing a query result if this query result must already be available as an input for the process?

Therefore, even if the authors' theory is formally correct, I do not see how and why it would be applied in practice. If, on the other hand, it is not supposed to be practically applicable, then the authors have to make this limitation clear. In its current form the paper does not seem to even mention the paradoxon.

In addition to this fundamental issue, the presented work has a few more major technical issues and a number of minor issues. In the remainder of this review I list these issues.

additional technical issues

* Definition 5.2 is formally incorrect because it uses LJ(P), which is undefined. That is, function LJ(.) has not been defined for BGPs. Instead, Definition 5.1 defines it for the nodes \Omega^k of a concrete "evaluation tree" (that assumes a concrete decomposition of a BGP P relative to a particular solution \mu_g). Consequently, Definition 5.2 can (and should) also be framed only in terms of a concrete evaluation tree. In fact, in the subsequent parts of the paper, the filter expression as introduced in Definition 5.2 is always used in the context of concrete evaluation trees (e.g., Lemma 5.5).

* Related to the previous point, the symbol R^P_{\neq b} with which the authors denote the filter expression introduced in Definition 5.2--in particular, the use of P within this symbol--seems to suggest a generality that does not exist and, thus, is misleadingly incorrect.

* Neither the \tau operator (as used in the proof of Theorem 9.2 and in Theorem 9.3) nor the notion of "a \tau reduced tree" (Theorem 8.5) are defined formally. As a consequence, these theorems and their proofs remain informal results.

* Similarly, the notion of a "pruning operation" (which is central in Theorem 9.2) has not been defined. Moreover, the proof of this theorem is concerned only with (combinations of) the three operators as studied in the papers (namely, \beta, \sigma, and \tau). To really show the theorem either the proof must show that these three operators (and combinations thereof) are the only "pruning operation[s]" that are possible or the proof must be made more general.

minor issues and suggestions

* The text is unnecessarily wordy. Just as one out of many examples, almost all occurrences of "of course" should be removed (e.g., "The intent, of course, is that b-components be query patterns that ..."). A more concise presentation would be appreciated.

* For each of the steps of the proofs, do not only mention the property that you are using but also refer to the definition of the property. For instance, in the proof of Theorem 4.4, instead of just writing "by df. of T^o-", write "by df. of T^o- (Def.4.7)." Such a cross-referencing is particularly helpful in later parts of the paper.

* Similarly, I strongly suggest to include a table that lists all the symbols used, the name of the concept that each of them denotes, and references to the corresponding definitions.

* I assume that Theorem 3.1 was supposed to be a definition rather than a theorem. If it is indeed meant to be a theorem, then the authors must first define the notion of "entails" and provide a proof.

* Irrespective of whether Theorem 3.1 is supposed to be a definition or a theorem, it is not entirely precise because the notion of an RDF homomorphism as used in Theorem 3.1 is defined in terms of "some context c" (see Definition 3.4). The dependency on a context cannot simply be ignored in Theorem 3.1.

* The paragraph after Theorem 3.1 is meaningless because entailment is not defined for "answer sets" (in contrast, it is defined for RDF graphs--if we assume that Theorem 3.1 is meant to be a definition).

* Various theorems assume RDF graphs that "are standardized apart" (e.g., Theorem 3.2). It is not clear what this is supposed to mean.

* The image \mu(P) as introduced in Definition 4.2 is already used before (e.g., Theorem 3.2, Definition 3.6, Theorem 3.3) and, thus, should be defined earlier (for instance, in Section 3.1).

* The notion of "point-wise P-equivalent" as used in the proof of Lemma 4.1 has not been introduced.

* In Definition 4.6, I think the formulas in bullet points 1 and 2 should be swapped.

* The paragraph after Definition 4.9 uses the term "\tau([[P]])", which has not been introduced.

* Similarly, I do not believe the function r as used in Theorem 4.4 has been introduced.

* Definition 5.1 is very confusing at first! It should be made clear that this is a recursive definition.

* In Lemma 5.2, I think it should be "and" instead of "or"

* In contrast to the claim in the sentence before Lemma 5.5, the lemma shows C-distributivity (not R-distributivity).

* In the second case in the proof of Theorem 7.6 it should be \Omega^L and \Omega^R (instead of \Omega^1 and \Omega^2).

* Definition 9.1: "...triple patterns ..." instead of "... triples ..."