Multiset semantics in SPARQL, Relational Algebra and Datalog

Tracking #: 3678-4892

Authors: 
Renzo Angles
Claudio Gutierrez
Daniel Hernandez

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
Abstract: 
The paper studies and determines the algebraic and logic structure of the multiset semantics of the core patterns of SPARQL formed by AND, UNION, OPTIONAL, FILTER, MINUS and SELECT. We show that it conforms a robust fragment with the same expressive power of well-known logical and relational query languages, named Datalog and Relational Algebra. Indeed, we identify and develop a logical formalism for multisets, namely non-recursive Datalog with safe negation, with the extension to multiset developed by Mumick et al., and a multiset relational algebra (projection, selection, natural join, arithmetic union and except), based on the framework defined by Dayal et al., and prove that these three formalisms have the same expressive power viewed as query languages.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Olaf Hartig submitted on 26/Sep/2024
Suggestion:
Minor Revision
Review Comment:

The manuscript studies the expressive power of a fragment of SPARQL under multiset semantics by identifying both a version of Datalog with multiset semantics and a fragment of the relational algebra with multiset semantics that have the exact same expressive power as the fragment of SPARQL considered by the authors.

The material is well organized and presented in an easy-to-follow way. The results are correct (with a minor exception, but that is not difficult to fix---see below) and, to the best of my knowledge, this is the first publication showing these results. For these reasons, I am generally in favor of accepting this work for publication in the journal. However, there are three main points that the authors need to work on a bit more, plus several other more minor points.

Main points to work on:

M1. The mapping from the multiset relational algebra (MRA) to SPARQL in Section 8.1 is not complete. In particular, the function g31 that translates every MRA database into a "SPARQL database" (i.e., an RDF graph) does not consider the cardinalities of the tuples in the multiset relations. More specifically, the RDF triples produced by the function \beta that the definition of function g31 is based do not capture the cardinalities. Due to this issue, the claim that "MRA can be simulated in SPARQL" (Lemma 13) is not fully supported. Fixing this issue may be a bit of work but I don't foresee any inherent complications in it: The idea would be to extend the \beta function such that the resulting RDF triples also describe the cardinalities of the represented tuples, and then to extend the query-related mapping functions (f13 and h13) accordingly.

M2. The manuscript provides very little (almost nothing) in terms of motivation of the presented work. The authors should extend the introduction (ideally even the abstract) with a few statements to tell the reader why the presented work is of relevance and what the potential impact of the presented results may be. (I see that the authors base their work on a few questions, which are listed in the 'Objectives and Contributions' part of the introduction. Yes, it is not clear what motivates these questions. Why might it be interesting to have answers to these questions?)

M3. The 'Related work and Conclusions' section (Sec.9) needs to be improved in two ways. First, there are several sentences for which it is not clear what they are supposed to mean:
i) "the multiset semantics of SPARQL has not been systematically addressed" <- What exactly do the authors mean by "systematically address[ing]" the multiset semantics of SPARQL?
ii) "the goal of characterizing the multiset algebraic and/or logical structure of the operators in SPARQL." <- I don't understand what this goal is meant to be. (What is a "multiset algebraic structure of operators"? What is a "logical structure of operators"? "characterizing" in terms of what?)
iii) "this study shows the complexities and challenges that the introduction of multisets brings to [...] SPARQL" <- I don't see how the presented work shows such complexities and challenges.
Second, the related works part should be improved as well. At the moment, it appears to be a bit unorganized and it is focused only on SPARQL. Related formalisms, specifically the ones that the considered variants of Datalog and MRA are based on, should be discussed here as well. In particular, I would expect the commonalities and the differences (if any) between these considered variants and the variants in the literature to be elaborated very clearly.

Additional things that need to be clarified:

A1. The formal semantics for multisets in SPARQL should not be claimed as a contribution (as done in Sec.1, line 23 on page 3) because this has already been provided by Perez et al. in 2006 and is the basis of the SPARQL specifications already since version 1.0 of SPARQL.

A2. In a similar sense (and also related to my point M3 above), it is not clear to me to what extend the authors can actually claim that they "develop a logical formalism for multisets" (NRMD¬) and that they "develop the relational counterpart of this fragment." In fact, it is not crystal clear from the manuscript which parts of these formalisms are taken from the literature and exactly which parts have been added. The manuscript needs to be improved in this aspect!

A3. Why exactly does Table 1 contain the same SPARQL expression ("A AND B") in different rows? There should be a brief explanation of this.

A4. What does it mean that "the root of ti unifies with Ai under θ"? (line 39 on page 10)

A5. Line 45 on page 10 considers the case that "the root of t is a colored version of F" where t is a derivation tree. This is not possible because, by the given definition of derivation trees, the roots of these trees are not colored (see lines 34 and 41 on page 10), as is also visible in Fig.3. So, something seems to be wrong or missing here.

A6. The manuscript is inconsistent and sloppy in terms of how it calls the things that are translated into one another. For instance, in lines 37-40 on page 13, it should be "SPARQL pattern" instead of "SPARQL query", it should be "RDF graph" instead of "SPARQL database", it should be "NRMD¬ answer" instead of "NRMD¬ query solution", and it should be "multiset of mappings" ("multiset of solution mappings") instead of "SPARQL query solution". There are many more such inconsistencies throughout the whole manuscript.

A7. Similarly, the manuscript sometimes uses the word "multiplicity" (e.g., line 43 on page 20, Def.23 on page 27) and more often the word "cardinality". Pick one and be consistent about it!

A8. Def.18: SPARQL is not defined for "a multiset of RDF triples" -- there is no such thing in the context of RDF and SPARQL.

A9. The formula of Σ_V at the end of page 25 seems incorrect (incomplete) to me. For instance, just saying "s,p ∈ V" in the first subset means that the condition "S=P" will be added to the selection operator if the subject and the predicate of the given triple pattern are variables, no matter whether they are the same or two different variables! This doesn't seem to make sense. Instead, in addition to "s,p ∈ V" there should also be the condition that "s=p". Same issue for the other two subsets in the formula of Σ_V.

Minor things to fix:

m1. Line 35 on page 3 mentions a union operator using a symbol that does not show up in the actual definitions in Sec.5.

m2. The title of Sec.4 (page 9) is missing a closing parenthesis.

m3. Line 44 on page 10 talks about "nr-Datalog¬", which has not been introduced. Probably this was meant to be "NRMD¬".

m4. Line 8 on page 12 mentions "relation schema R = {A1,...,An}" which is not entirely correct. By the definitions given in the previous paragraph, R and {A1,...,An} are different things.

m5. Line 14 on page 12: "set of relational schemas" --> "set of relation schemas"

m6. Lines 15-16 on page 12: What does it mean that a "relation ri satisfies the schema Ri"?

m7. The definition in lines 35-36 on page 12 assumes that attribute A is in \widehat{E}_i, which should be mentioned explicitly as part of this definition.

m8. Line 1 on page 13: How is this notion of "is equal to" defined?

m9. The definition of Eval(E,D) on page 13 assumes that E and D are over the same relational database schema. This assumption should be made explicit in lines 6-7 and, also, by adding "over T" at the end of the first sentence on line 8.

m10. Line 49 on page 13 mentions "each term t ∈ G" which is incorrect. G is a set of triples, not a set of terms.

m11. The notation used in line 34 on page 17 has not been introduced. The notation for writing multisets as given in the paper is a different one (namely with a card function).

m12. It is not clear whether the variables used in Datalog substitutions are of the same kind as the variables in SPARQL solution mappings. According to the formulas in Def.8 that seems to be the case, but later in Def.11, the authors seem to make a difference.

m13. Def.9: the symbols p, c1, and cn (as used in the formula of the definition) are not introduced within the definition.

m14. Line 40 on page 18: What is "gp(L)^D_Π" ?

m15. Point 2 at the end of page 18 aims to define T(vr(Ri,L)) but the bullet points that follow introduce only T(Ri).

m16. Line 15 on page 19: "{(a0,a0,a0)}" --> "{(NULL,NULL,NULL)}"

m17. Similar issue with a0 in line 37 on page 19.

m18. Line 50 on page 19 mentions "θμ = {X1/c1,..., Xn/cn}". Why this notation? Where is it introduced? Why not the same notation as for the SPARQL solution mappings? Both are (partial) functions after all.

m19. Line 41 on page 20: "under" --> "over"

m20. Def.14: It is inconsistent to use the symbol R here. It should be r instead (to be consistent with Sec.5.1). Same issue in Sec.7.2.1 and in Def.15.

m21. Line 41 on page 23: "MRA relations" are undefined. Sec.5.1 calls them "multiset relations"!

m22. Same issue in lines 24-25 on page 25.

m23. Line 45 on page 23: "... to IRIs[, and tuples to IRIs]."

m24. Lines 28-29 on page 24: What are u1 and u2 in the given SPARQL pattern?

m25. Line 25 on page 25: "a set of tuples" --> "a multiset of tuples" !!

m26. Line 42 on page 25: "a multiset of RDF triples" --> "a set of RDF triples"

m27. Line 1 on page 26: "Selection(s,p,o)" --> "Selection(T)"

m28. Def.22: "from graph patterns" --> "from normalized graph patterns"

m29. Def.22: "whose selection formulas" --> "whose filter conditions"

m30. Table 7: What is P_\emptyset (in the first row of the table)? In fact, I don't think that this row is needed. The base case of the syntax are triple patterns, which are covered by the next row in the table.

m31. Line 43 on page 27: "RDF mapping" --> "SPARQL mapping"

Review #2
By Dörthe Arndt submitted on 08/Oct/2024
Suggestion:
Minor Revision
Review Comment:

The paper “Multiset semantics in SPARQL, Relational Algebra and Datalog” investigates the relationship between a fragment of SPARQL (the relational core), the Multiset Relational Algebra (MRA), and Non-recursive multiset Datalog with safe negation (NRMD^\neg). The authors prove that all of these frameworks have the same expressive power.

The paper is well-structured. First, the concept of query languages and their expressive power is formally defined. Then the three different frameworks are introduced in detail. This is followed by sections which always focus on two frameworks, define simulations from the first framework to the second and back and thereby show that these have the same expressive power. These sections use the notation from the introducing section such that it is easy to follow which exact mapping is defined in which moment and why this mapping is needed. The paper then discusses related work and concludes by listing the findings.

From my really subjective view (more as an interested reader), I really enjoyed the section about the problems of translating SPARQL Filters to Datalog which was well-explained and illustrated by examples.

While I appreciate the clear structure of the paper and also its contribution, I see two main points for improvements:

1. I could not always fully understand the details of the definitions or formalisations. This was for example the case for the section about MRA where it is important to understand how r, R, ^R, t, T and ^r relate or in Table 4 where the Datalog rules for SPARQL queries are introduced but not further explained. Here I would recommend to spend more times on details and also add examples where possible

2. I know that this is for space reasons, but in my opinion it is a little bit unfortunate that the main contribution of the paper, the proofs of the frameworks having the same expressive power can only be found in the appendix. I would recommend to either put longer proof sketches into the paper itself or to add examples illustrating the mappings such that the reader can better appreciate the contribution.

More detailed comments and questions:
- Definition 4: term(t) for each term t\in G -> G is a set of triples, so how is that meant?
- mapping on page 18: the definition of T(vr(R_i,L) is defined using T(R_i), but the latter is never defined, also gp(L)^D_\pi is not introduced
- introduction of multiset relations (page 12): it could be that the reason is that I am less familiar with MRA than with SPARQL and Datalog, but that part was difficult for me to read and would benefit from examples or at least from more detailed explanations.
- page 17, table 4, (P_1 AND P_2): the need for the predicate comp becomes only clear after reading section 8.2.1 where we have a similar construct. This is far too late, the rule needs explanation.
- If the authors show that SPARQL and NRMD^\neg have the same expressive power and that NRMD^\neg and MRA have the same expressive power, then it directly follows that also SPARQL and NRMD^\neg have the same expressive power. Why did the authors choose to still give a proof for the latter?
- page 11: In the proof for lemma 2, rules are rewritten and there is a claim that it is clear that the resulting program is normal because the original program was safe. I don’t see why var(L_2)=var(L_1), I can only see the inclusion. Maybe that could be clarified?
- page 12: “A relational database schema is a set of relational schemas. Given a relational database schema T = {R1, . . . , Rn}, a multiset relational database over T is a set of multiset relations {r1, . . . ,rn} where each relation ri satisfies the schema Ri .” -> what does it mean to satisfy a schema?
- Lemmas 8, 9, etc. I personally would prefer to see the direct claims instead of the lemmas, mainly because the functions f,g,h are already defined in the text, so it would make sense to directly state that these are simulations. I am furthermore aware that the proofs don’t fit in the text, but I would prefer to get some more detail about the respective proofs here. An alternative could be to provide a small example how the simulation works.
- page 18: NRMD^\neg to SPARQL: I think an interesting detail of the mapping is that the predicate name is handled like the arguments and also is represented by a special triple (u, \alpha_0,p). I would mention that somewhere.
- page 23, Definition 18: SPARQL is defined on RDF graphs which are sets, the definition maps to multisets of RDF triples. Here I see a mismatch which needs to be resolved.
- Related work and conclusions: I think that the contribution is interesting, I would however expect the authors to write a little bit more about the relevance of the findings. Maybe some practical consequences for implementers or the fact that proven results for one framework then can be transferred to others?

Minor comments and typos:
- page 7+8: var(t), var(P) -> var is used but not defined
- page 8, Definition 3: “Every sub-pattern… holds that …” -> “For every sub-pattern… it holds that …”
- page 8: “..., the following equivalences are hold”-> “..., the following equivalences hold”
- page 14: the “solution to the query on the right side” is used twice, I guess the first one should be the left side?
- page 16: rules presented in Table 5 -> Table 4
- page 18: especial -> special
- page 18, defintion 9: {NULL, NULL, NULL} ->{(NULL, NULL, NULL)}, as g_{2,1} maps into a set of triples
- page 18: “… is equivalent to R but have literal L as head.” -> has
- page 19, Example 4: gp((q(X), \pi)) ->gp(q(X), \pi)
- page 19: “This is done using the additional triple {NULL, NULL, NULL}...”, most likely (NULL, NULL, NULL) is meant?
- page 19, Example 5: f_{2,1}((q(X), Π)-> f_{2,1}((q(X), Π))
- page 20, Theorem 1, page 23, Theorem 2, page 28, Theorem 3: “If follows from Lemma X and Lemma Y” -> The Claim follows from…
- page 20: “ For each tuple t in r, Σ(r) contains a fact f of the form p(c1, . . . , cn) where p is R,...” where p is the image of R (or however we want to use the aforementioned mapping)