Review Comment:
The manuscript studies the expressive power of a fragment of SPARQL under multiset semantics by identifying both a version of Datalog with multiset semantics and a fragment of the relational algebra with multiset semantics that have the exact same expressive power as the fragment of SPARQL considered by the authors.
The material is well organized and presented in an easy-to-follow way. The results are correct (with a minor exception, but that is not difficult to fix---see below) and, to the best of my knowledge, this is the first publication showing these results. For these reasons, I am generally in favor of accepting this work for publication in the journal. However, there are three main points that the authors need to work on a bit more, plus several other more minor points.
Main points to work on:
M1. The mapping from the multiset relational algebra (MRA) to SPARQL in Section 8.1 is not complete. In particular, the function g31 that translates every MRA database into a "SPARQL database" (i.e., an RDF graph) does not consider the cardinalities of the tuples in the multiset relations. More specifically, the RDF triples produced by the function \beta that the definition of function g31 is based do not capture the cardinalities. Due to this issue, the claim that "MRA can be simulated in SPARQL" (Lemma 13) is not fully supported. Fixing this issue may be a bit of work but I don't foresee any inherent complications in it: The idea would be to extend the \beta function such that the resulting RDF triples also describe the cardinalities of the represented tuples, and then to extend the query-related mapping functions (f13 and h13) accordingly.
M2. The manuscript provides very little (almost nothing) in terms of motivation of the presented work. The authors should extend the introduction (ideally even the abstract) with a few statements to tell the reader why the presented work is of relevance and what the potential impact of the presented results may be. (I see that the authors base their work on a few questions, which are listed in the 'Objectives and Contributions' part of the introduction. Yes, it is not clear what motivates these questions. Why might it be interesting to have answers to these questions?)
M3. The 'Related work and Conclusions' section (Sec.9) needs to be improved in two ways. First, there are several sentences for which it is not clear what they are supposed to mean:
i) "the multiset semantics of SPARQL has not been systematically addressed" <- What exactly do the authors mean by "systematically address[ing]" the multiset semantics of SPARQL?
ii) "the goal of characterizing the multiset algebraic and/or logical structure of the operators in SPARQL." <- I don't understand what this goal is meant to be. (What is a "multiset algebraic structure of operators"? What is a "logical structure of operators"? "characterizing" in terms of what?)
iii) "this study shows the complexities and challenges that the introduction of multisets brings to [...] SPARQL" <- I don't see how the presented work shows such complexities and challenges.
Second, the related works part should be improved as well. At the moment, it appears to be a bit unorganized and it is focused only on SPARQL. Related formalisms, specifically the ones that the considered variants of Datalog and MRA are based on, should be discussed here as well. In particular, I would expect the commonalities and the differences (if any) between these considered variants and the variants in the literature to be elaborated very clearly.
Additional things that need to be clarified:
A1. The formal semantics for multisets in SPARQL should not be claimed as a contribution (as done in Sec.1, line 23 on page 3) because this has already been provided by Perez et al. in 2006 and is the basis of the SPARQL specifications already since version 1.0 of SPARQL.
A2. In a similar sense (and also related to my point M3 above), it is not clear to me to what extend the authors can actually claim that they "develop a logical formalism for multisets" (NRMD¬) and that they "develop the relational counterpart of this fragment." In fact, it is not crystal clear from the manuscript which parts of these formalisms are taken from the literature and exactly which parts have been added. The manuscript needs to be improved in this aspect!
A3. Why exactly does Table 1 contain the same SPARQL expression ("A AND B") in different rows? There should be a brief explanation of this.
A4. What does it mean that "the root of ti unifies with Ai under θ"? (line 39 on page 10)
A5. Line 45 on page 10 considers the case that "the root of t is a colored version of F" where t is a derivation tree. This is not possible because, by the given definition of derivation trees, the roots of these trees are not colored (see lines 34 and 41 on page 10), as is also visible in Fig.3. So, something seems to be wrong or missing here.
A6. The manuscript is inconsistent and sloppy in terms of how it calls the things that are translated into one another. For instance, in lines 37-40 on page 13, it should be "SPARQL pattern" instead of "SPARQL query", it should be "RDF graph" instead of "SPARQL database", it should be "NRMD¬ answer" instead of "NRMD¬ query solution", and it should be "multiset of mappings" ("multiset of solution mappings") instead of "SPARQL query solution". There are many more such inconsistencies throughout the whole manuscript.
A7. Similarly, the manuscript sometimes uses the word "multiplicity" (e.g., line 43 on page 20, Def.23 on page 27) and more often the word "cardinality". Pick one and be consistent about it!
A8. Def.18: SPARQL is not defined for "a multiset of RDF triples" -- there is no such thing in the context of RDF and SPARQL.
A9. The formula of Σ_V at the end of page 25 seems incorrect (incomplete) to me. For instance, just saying "s,p ∈ V" in the first subset means that the condition "S=P" will be added to the selection operator if the subject and the predicate of the given triple pattern are variables, no matter whether they are the same or two different variables! This doesn't seem to make sense. Instead, in addition to "s,p ∈ V" there should also be the condition that "s=p". Same issue for the other two subsets in the formula of Σ_V.
Minor things to fix:
m1. Line 35 on page 3 mentions a union operator using a symbol that does not show up in the actual definitions in Sec.5.
m2. The title of Sec.4 (page 9) is missing a closing parenthesis.
m3. Line 44 on page 10 talks about "nr-Datalog¬", which has not been introduced. Probably this was meant to be "NRMD¬".
m4. Line 8 on page 12 mentions "relation schema R = {A1,...,An}" which is not entirely correct. By the definitions given in the previous paragraph, R and {A1,...,An} are different things.
m5. Line 14 on page 12: "set of relational schemas" --> "set of relation schemas"
m6. Lines 15-16 on page 12: What does it mean that a "relation ri satisfies the schema Ri"?
m7. The definition in lines 35-36 on page 12 assumes that attribute A is in \widehat{E}_i, which should be mentioned explicitly as part of this definition.
m8. Line 1 on page 13: How is this notion of "is equal to" defined?
m9. The definition of Eval(E,D) on page 13 assumes that E and D are over the same relational database schema. This assumption should be made explicit in lines 6-7 and, also, by adding "over T" at the end of the first sentence on line 8.
m10. Line 49 on page 13 mentions "each term t ∈ G" which is incorrect. G is a set of triples, not a set of terms.
m11. The notation used in line 34 on page 17 has not been introduced. The notation for writing multisets as given in the paper is a different one (namely with a card function).
m12. It is not clear whether the variables used in Datalog substitutions are of the same kind as the variables in SPARQL solution mappings. According to the formulas in Def.8 that seems to be the case, but later in Def.11, the authors seem to make a difference.
m13. Def.9: the symbols p, c1, and cn (as used in the formula of the definition) are not introduced within the definition.
m14. Line 40 on page 18: What is "gp(L)^D_Π" ?
m15. Point 2 at the end of page 18 aims to define T(vr(Ri,L)) but the bullet points that follow introduce only T(Ri).
m16. Line 15 on page 19: "{(a0,a0,a0)}" --> "{(NULL,NULL,NULL)}"
m17. Similar issue with a0 in line 37 on page 19.
m18. Line 50 on page 19 mentions "θμ = {X1/c1,..., Xn/cn}". Why this notation? Where is it introduced? Why not the same notation as for the SPARQL solution mappings? Both are (partial) functions after all.
m19. Line 41 on page 20: "under" --> "over"
m20. Def.14: It is inconsistent to use the symbol R here. It should be r instead (to be consistent with Sec.5.1). Same issue in Sec.7.2.1 and in Def.15.
m21. Line 41 on page 23: "MRA relations" are undefined. Sec.5.1 calls them "multiset relations"!
m22. Same issue in lines 24-25 on page 25.
m23. Line 45 on page 23: "... to IRIs[, and tuples to IRIs]."
m24. Lines 28-29 on page 24: What are u1 and u2 in the given SPARQL pattern?
m25. Line 25 on page 25: "a set of tuples" --> "a multiset of tuples" !!
m26. Line 42 on page 25: "a multiset of RDF triples" --> "a set of RDF triples"
m27. Line 1 on page 26: "Selection(s,p,o)" --> "Selection(T)"
m28. Def.22: "from graph patterns" --> "from normalized graph patterns"
m29. Def.22: "whose selection formulas" --> "whose filter conditions"
m30. Table 7: What is P_\emptyset (in the first row of the table)? In fact, I don't think that this row is needed. The base case of the syntax are triple patterns, which are covered by the next row in the table.
m31. Line 43 on page 27: "RDF mapping" --> "SPARQL mapping"
|