Review Comment:
In this paper the authors study the evaluation of federated conjunctive queries (CQs)
over RDF datasets containing blank nodes. The main objective of the study is to
determine the minimal amount of information that needs to be transmitted in order
to correctly answer the query in this setting. To this end, the authors propose
several methods for removing results of different triple patterns forming the query,
or by reducing them in case some variables they bring will not form part of the output.
Overall, the setting of the paper is quite interesting, and the main problem that is
tackled is relevant for the Semantic Web community. The results, although mostly straightforward, do shed light on how to minimize the amount of information that needs to be transmitted, and the operators the authors propose for this task could prove useful in practice when one is dealing with blank nodes over distributed RDF sources. However, the presentation of the material has a lot to be desired, and in its present state the paper is simply not ready for publication.
The main complaint I have about the paper is regarding its lack of mathematical rigidity,
which is crucial here, considering that the main aim of the manuscript is to give a formal framework for dealing with blank nodes under federation and *prove* how various operations can make CQ evaluation efficient in this setting. Probably the biggest issue here is that many terms are left undefined, and the reader has to either look them up in the previous literature,
or figure them out from the proofs. At times it almost seems as if the authors assume the reader to be familiar with all the terminology in this setting, which I find rather unreasonable for a journal publication that is meant to be selfcontained. The best illustration of this is probably the fact that it is never formally defined what it means to answer a CQ in the federated setting (see more detailed comments about Sec 3 below).
Next, I have some reservation about certain proofs. And while the validity of the results whose
proofs are in question is easy to verify, I find it somewhat troubling that the authors do not seem to be aware of the result they are trying to prove (see e.g. comments about Lemma 6.6).
Finally, the lack of examples is quite evident in the first part of the paper when the terminology is introduced, thus requiring the reader to go over the text several times in order to get the idea of what is going on. I believe that adding examples would greatly improve the readability of the manuscript and significantly improve the presentation of the material.
In conclusion, in its present state, the paper is simply not rigorous enough to be accepted for publication. Missing definitions, incomplete proofs, and unclear notation being the main detractors here, with the rather poor readability coming at a close second. However, one has to note that the results of the manuscript are indeed quite interesting and might provide useful in practice. Therefore, should the aforementioned issues be fixed, I will be glad to give the paper another careful read. My recommendation is to revise and resubmit the manuscript. More specific comments follow.
Main issues that need to be fixed:
1. The context of CQ evaluation under federation is never formally defined. In particular, I found it difficult to understand at first if all the patterns are to be evaluated at all the sources, if the user predetermines which pattern is evaluated where, etc. For a theoretical paper these notions should be defined formally and illustrated by an example. I believe that this was in fact done before in the paper "Semantics and optimization of the SPARQL 1.1 federation extension" (Aranda, Arenas, Corcho), and the extension to the notation of this manuscript should be rather straightforward. I believe that this will be much more readable than the explanation in Section 3.3, which I had to read several times before understanding what the setting studied in the paper actually is. (In fact, the best explanation comes at the very end in Section 10.)
2. Many concepts are not defined well, or are assumed to be known from the previous literature. For instance, homomorphisms between RDF graphs, what it means to be standardized apart, the fact that all blanks are mapped to the same term (Examples 6.3 and 7.3), all of which makes it difficult to follow some proofs. The concepts which are used in the paper should be defined formally.
3. Lack of examples. As discussed above, this is quite problematic, as it makes it difficult to follow the text. Starting form the introduction, examples should be introduced to either illustrate the point authors are trying to make, or to help understanding formal definitions. The previous paper of the authors' in this same journal seems to do a much better job at this. Since the lack of space is not an issue here, this should be fixed, even if repetition is required.
4. Problems with some proofs. These need to be fixed before the paper is accepted (details follow below).
5. Writing in general needs some work. While the English is not the problem here, the way ideas/concepts are introduced is at times not visible, since they appear inside text. Perhaps once all concepts are properly formalised this will cease to be an issue.
Specific comments section by section
[cited text from the manuscript is inside quotation marks ""]:
### Introduction:
Generally quite good, but an example illustrating the points made on page 1/column 1 of page 2, would greatly improve readability. Also, the term zeroknowledge is not that standard, so I had some problems understanding what is going on during a first read. Perhaps this could be expanded a bit.
typos:
"oft" > often
"documents containing RDF collected"  rephrase
"is is"
"enjoys pride of place"  rather archaic
### Section 2:
I find this section somewhat redundant.
### Section 3:
"Selectprojectjoin"  this is not really the fragment used in most of the paper; it is just join for row operations, and then selectjoin for column ones
The notation P_i \cup P_j is a bit nonstandard as it defines a join; perhaps stress this as to not confuse the "database reader"
Start of 3.1.  perhaps a bit more formality here; define compatible mappings, etc.
Def 3.2  in both cases this is c(\mu(u)); but if you want to stress the different treatment of blanks OK
Def 3.3  the set \mathcal{G} is not linked with the G below.
Text after Def 3.3, while intuitively clear, is not linked to the definition, as the latter uses a single graph at a time. Formalising this and/or using an example might be of help here.
Def 3.4  the formulation is ambiguous. Does every blank get mapped to the same term or not? Also, saying "any blank node" could signify that the mapping might be partial on blanks. See comments about Section 6 and 7 below.
Thm 3.1  homomorphism between RDF sets not defined previously.
Thm 3.2  not defined what "standardized apart" means. Furthermore, it is still confusing at this point how federated answering works. Does the user define the graph/context where the pattern is evaluated or not? This should be clear from the start. Therefore: examples + the formalisation of the setting are needed previously.
Def 3.5 + 3.6  examples needed; also, the notation f(\mu_g,P) should be formally introduced: i.e. let f be a function assigning...
Thm 3.4 + section 3.3  notation is quite bad, as c_m,...,c_n is not a good enumeration for P_1,...,P_k; changing m for 1 and n for k is much easier to follow.
Def 3.7  Perhaps if the setting is defined by attaching to each subquery P_i the IDs of sources it is identified over, then all \mathcal{G_i} can be made equal G, with evaluation being empty when the source is not in the list. This depends on how the whole setting is formalised; but a good option is to replace P_i with (S_i,P_i), where S_i is the list of sources P_i will be evaluated over. The entire query is then join of all P_is, etc.
Sec 3.3. Overall, after reading this I was completely lost as of what the setting being studied actually is. This is why formalisation and examples are needed.
typos:
"As regards RDF specifics"  rephrase
### Section 4:
Main comments:
a) The notation should be fixed and formalised. More details below, but the notion of trees, how they decompose, etc. should be defined in a clear way.
b) Maybe stress a bit more that column conditions require equality, while the row ones require only equivalence. This is later used quite a bit.
Def 4.2  t[\mu] the definition should be the other way around
Def 4.3  need to specify what P,p,\Mu are; also, "We write ..." should say that the symbol denotes equivalence
Both definitions need examples.
Lemma 4.1. Column/row operators should be defined formally... i.e a function from to...; Also, why is a proof needed here. Equality will always imply equivalence.
Def 4.4.  evaluation tree and join not defined formally before (i.e. is this the join of trees. results, roots?); also, it needs to be commented how a single triple pattern can be viewed as a join
rootpattern not defined
before Def 3.5  L/R should refer to left and right subtree, but is not mentioned
before Def 4.6  maybe add a footnote emphasising that OPTIONAL is not considered in this paper
Def 4.6. columns/rows are reversed; also , put an example
Rpadding  o(o(\Mu) join o(\Mu))
Def 4.7  I am assuming this is intended to be computed bottomup (i.e. starting at the leaves)
Thm 4.2  r(T) is not defined; I am assuming it is the root
Cpadding  maybe use a different letter for row/column operators
Def 4.10  it is somewhat strange that this is the first place projection is mentioned; maybe introduce this in Sec 3? Also, these would be select variables in the SQL notation.
Corollary 4.6  the statement of the corollary is incomplete. As it is, it is identical to Thm 4.5, so a clarification of what omegas are is needed
typos:
"algebraic expression"  a bit of an overloaded term
"will be treated denoted"
"what requirements that it is reasonable"
### Section 5:
This section is generally much better written than the previous ones, providing examples and clean proofs (one minor omission). Some formalisation is still lacking though.
Def 5.1  Omegas should be all the nodes, not just leaves
Ex 5.1. Why introduce \mathbb{B} for blanks here; better just continue using B?
p12, first paragraph "kons on blank nodes are bracketed"  what is this supposed to mean?
!!! Lemma 5.1.  induction step "essentially by repetition"  this is not true, as one needs to expand the definition of JV through union. This should be added to the proof.
Sec 5.1  first paragraph should be illustrated by an example
Lm 5.2  the notion of filter, R, \sigma_R should be defined properly; the result is correct though
Also, filter is an overloaded work in the Semantic Web/relational DBs context, so it is not clear to which one the paper is referring.
before Lm 5.3  filters need to be introduced formally; in particular the comment about conjunctive normal form is pertinent, as without it the proof fails
Lm 5.3  second to last paragraph of the proof: second sentence is not needed.
before Lm 5.5 "with the provisos gone"  not the best way to state things
Lm 5.5  one wants to prove equivalence really (Rdistributivity). Please make more clear; maybe say in the proof that something stronger can be shown (Cdistributivity)
Thm 5.6  notation lacking; Omega is a tree, Omega_1 left child, etc. In the proof \mu_1, \mu_2 in the second to last paragraph are \mu_i and \mu_j.
As a side note, the proof of this theorem is the first place where it is precisely clear what the setting is (i.e. each P_i is evaluated on different source). Hence my insistence on formalising the setting at the beginning.
### Section 6:
Main comment: the entire section assumes that all blanks get mapped to the same constant. This should have been formalised when defining mappings and homomorphisms. Also, please remark that this is not how blanks are usually treated in the literature (also, in the relational setting there is the notion of named nulls, which intuitively allow mapping different nulls to different constants).
Thm 6.1  maybe call a proposition
below Thm 6.1 "in Figure 4: the second and third row are equivalent"  it was not defined what this means. In particular, if in Def 3.4 we allow blanks to be mapped to different constants, this is not true. These things need to be clear from the start.
below Def 6.1.  the notion of Pshaped graph should have been introduced formally before
Lm 6.4. While easy to guess, the notation \Omega(P) should have been introduced formally in Def 4.2.
Ex 6.1 The mappings h_1 and h_2 should be given explicitly, together with their inverses. In line with the comments above, it is not clear how _b1 is to be treated. Assuming it is a blank (which would be _:b1 in the notation of the paper) Figure 6 provides some insight into this, but the example should be redone to reflect all of this. Also, \beta{\Omega} = \{\mu_1,\mu_2\} if I'm not mistaken.
Lm 6.5  I am not sure if "preserves" is the best word here. Also, it might be good to repeat here how a result set \Omega is connected to the tree T_\Omega.
!!! Lemma 6.6  While the result is trivially correct, the proof does not hold up. In particular, when \Omega is singleton, one can not assume that there are two mappings in \beta(\Omega). In fact, even disregarding this, the second sentence of the second paragraph does not follow (not in all cases anyway). On the other hand, the proof is not really needed here, as the claim is that if some equivalence class is missing one can not cover the entire set being partitioned by the relation, which is trivially true for any equivalence relation.
Ex 6.3  \mu_1 \cup \mu_2 is not compatible itself; the two mappings are.
### Section 7:
My main problem is with the proof of Lemma 7.3. Perhaps I am missing something here. Also, Example 7.3 really illustrates why all of the notions should be defined formally in a clear way.
Second paragraph  "It enjoys local minimality property"  this was not proved nor discussed anywhere
Third paragraph  "compatible choice functions" should be defined formally in line with Ex 6.3; also, the row minimality comment is very vague
Thm 7.2  in the proof one should use g(\overline{\mu}) and not just g(\mu) in many places; also, \mathbb{B} is used for blanks, even though B is the notation introduced in Section 3.
!!! Lemma 7.3  I do not follow why \overline{\mu}=\{\mu\}. Was this proved somewhere before? As far as I see it, \overline{\mu}=\{\mu\} only when there is at most one blank node and is generally not true.
!!! Ex 7.3  \mu_1' \nsim_P_1 \mu_1'' is confusing, as in Ex 6.3 a different notion is used, namely that all blanks go to the same constant. It seems that here we are assuming that a blank is mapped to a constant only if the constant is also present inside the same mapping, which is something that should be formally defined from the beginning.
### Section 8:
This section and the following one are much more clear than the rest of the paper.
Thm 8.3  V\subseteq Var(\Omega^1); also, although the proof is correct, I think one can prove this easier by applying Thm 8.2
"falls out" >follows
Thm 8.5  maybe explain previously what does a nontrivial extension mean.
### Section 9:
Again, very nice and clear section in general.
Def 9.1  I am not convinced that this is the best notion. It can be the case that triple pattern contributes with a variable ?x, but contains one not inside X as well, thus excluding it from the set. E.g. if X={?x} and P = (?x ?y a) (and similarly when more triples are present). Also, why is this notion needed really? All the proofs work without it anyway.
"legit" is a bit informal
Thm 9.3  o(\Omega) in lines two and three of the first equation in the proof.
"a paper sized package"  strange way of putting it.

Comments