Review Comment:
This manuscript introduces a well-defined approach to map Property Graphs to RDF-star graphs, including an algorithm that implements this approach. The mapping is user configurable in the sense that users have to provide a so-called "context" description that specifies for each type of node and edge in their Property Graph what triples are to be created for each node/edge of that type. The corresponding templates for such triples can refer to the values of the properties that the nodes/edges have in the Property Graph (such that these values can be used as literals in the triples created from these templates). By this approach, it is possible to use arbitrary RDF vocabularies for the resulting triples.
The main technical contribution of the manuscript is to show formally that the mapping is reversible if the aforementioned user-provided context satisfies a number of conditions, where reversibility means that any input Property Graph can be reconstructed from the resulting RDF graph in combination with the context that was used for the conversion. As part of showing this reversibility property, the authors also provide an algorithm for the reverse conversion from resulting RDF graphs back to Property Graphs.
While there already are a few other publications about mappings from Property Graphs to RDF, the idea with the template triples as proposed in this manuscript is interesting. Moreover, to the best of my knowledge, there is no other work in the literature that formally (!) shows the reversibility property (sometimes also referred to as information preservation or losslessness) for an approach that provides the same level of flexibility in terms of the way the mapping can be customized by users as is done in this manuscript. I should also mention that I have checked the proofs in detail and can confirm that they are correct. In conclusion, I would be happy to see this work published in the journal ... but not in its current form. Instead, there are some things that need to be improved in the manuscript and some other things that I would like to see being added by the authors. The remainder of this review elaborates on these things. Additionally, a scan of a print-out of the manuscript with several minor comments and points that should be fixed can be found at:
https://www.dropbox.com/scl/fi/81rlyga8yi741haw2c4sm/swj3426.ScanOfComme...
(1) INSUFFICIENT MOTIVATION
The motivation for the presented work needs to be made more clear. In particular:
1.a) The introduction needs to state more explicitly what exactly the research question/problem is that the presented work aims to address.
1.b) Related to the previous point, the introduction also needs to make clear what is challenging about this research question/problem.
1.c) Moreover, I would expect some statement about why "the foreseen scenario is the conversion from PG to RDF" and not the other way around (and, by the way, by whom is this scenario "foreseen"?)
(2) COMPLEXITY ANALYSES MISSING
2.a) I would like to see a complexity analysis of the proposed PG-to-RDF conversion algorithm (Algorithm 1) and perhaps also of the reversion algorithm (if that one is also meant to be used in practice).
2.b) Given that the whole idea of the approach is to enable users to convert PGs into RDF in a "reversible" way, and that this reversibility requires that the given context is a well-behaved one, I think it is crucial that checking whether a given context is well behaved needs to be a computationally tractable task. Consequently, a missing piece of the presented work is a complexity analysis of the corresponding decision problem (for any given context ctx, decide whether ctx is well behaved), ideally in combination with an actual algorithm that can be used to decide.
2.c) Similarly, another decision problem that is relevant and should be discussed is to check for any given context ctx and any given PG pg whether ctx is complete for pg (as per Def.19), because this is another requirement for the contexts used for the conversion. In fact, the latter requirement is not only relevant for the reversibility but even for irreversible conversions (Algo.1).
(3) DISCUSSION OF BLANK NODES
Elements of PGs (nodes and edges) and blank nodes are typically considered as separate types of concepts in the literature. In contrast, the presented approach assumes that nodes and edges of PGs may be blank nodes. Since this is an unusual assumption, it deserves some discussion. Are there any consequences of this assumption? Can the same blank node be used in multiple PGs? Can it even be used as a node in one PG and as an edge in another PG? What would that mean?
(4) FORMALIZATION HARDER TO READ THAN NECESSARY
4.a) The paper uses a lot of different symbols and forms of notation. It is very hard for the reader to keep all of these things in mind. I strongly recommend the authors to try to reduce these things and to repeatedly give the reader clues throughout the paper. One example of how this can be done: Def.3 introduces the notion of a PG as a tuple that consists of six elements, but the symbols for these elements contain the symbol that denotes the PG as a whole. That makes the formulas that use these symbols very overloaded, and a related problem is that later parts of the paper do not even use the introduced form of a tuple. My proposal to fix these issues for this example (and the same approach can be adapted to many other aspects of the formalization as well) is to remove the subscripts from the symbols that denote the elements that are part of the tuple that defines a PG (i.e., "a property graph $g$ is a tuple $(N, E, \mathit{src}, \mathit{dest}, \mathit{labels}, \mathit{properties})"). Then, whenever a PG is mentioned later, it is introduced with the tuple that defines it (which makes it easier for the reader to remember where the individual symbols in the tuple come from). For instance, Def.4 should be written as follows: "The empty PG is a PG $p = (N,E,\mathit{src},\mathit{dest},\mathit{labels},\mathit{properties})$ for which it holds that $N=E=\emptyset$, ..." and Example 3 should be written as follows: "... can be captured formally by the tuple $g_\mathit{TT} = (N,E,\mathit{src},\mathit{dest},\mathit{labels},\mathit{properties})$ with $N= ...$ ..." In a context in which two different PGs are discussed (e.g., Def.6) the symbols of these two PGs can simply be distinguished by adding an apostrophe to the symbols of one of the two PGs.
4.b) Another good way to reduce the number of symbols and, at the same time, to make the text easier to read is to get rid of any symbol that, for some kind of things, denotes the set of all of these things. An example would be the symbol 'PGs' that denotes the set of all property graphs (line 7 on page 6). Instead of using symbols like this, the text should simply state explicitly what kind of thing a particular other symbol is meant to denote. For instance, instead of saying "∀(G, H) ∈ PGs^2, G and H are isomorphic iff [...]", Def.7 should be written as follows: "Two property graphs $G$ and $H$ are isomorphic iff [...]." There are more such "all-of" symbols that should be dropped (e.g., the symbols 'RdfTriples', 'BPGs', 'Ctx', 'Ctx_pg', Ctx^+').
4.c) Similarly, the overly aggressive use of the two quantification symbols (∀ and ∃) as a replacement for writing actual text makes the paper more cumbersome to read than it should be. For instance, Def.7 should better continue as follows: "... iff there exists a renaming function $\phi$ such that $\mathit{rename}(\phi,G)=H$."
4.d) Yet another issue that makes it more difficult than necessary for the readers to follow the formalization is that, for some notions, the authors use multiple different types of symbols throughout the paper. For instance, property graphs are sometimes denoted by G and sometimes by pg, template triples are sometimes denoted by tp and sometimes by t. This form of inconsistency must be avoided.
4.e) Finally, an aesthetic issue: For symbols used in the formalism, all symbols that consist of more than one character should be wrapped in a \mathit{..} command instead of writing them directly within the math environments. When writing them directly, LaTeX treats each character as a separate symbol and uses weird-looking spacing in some cases. As an example, consider the symbol 'Str' in line 45 of page 4 or the symbols 'subject', 'object' and 'RdfTriples' in Def.9.
(5) MINOR
5.a) When reading that the conversion is "without information loss" or "reversible", I would assume that the resulting RDF graphs contain all the information that is present in the converted PGs. However, this is not actually the case for the presented approach. That is, while the information captured by the properties in a PG is somehow captured in the resulting RDF graph, the information captured by means of the labels of nodes and edge does not need to be captured in the RDF graph. Instead, when recovering the PG from the RDF graph, the information about the labels is taken from the context that has been used. Hence, the context is an essential input for the reversion algorithm (and, thus, would need to be kept available in cases in which users want to be able to restore the original PG). I don't see this as a problem but as an interesting aspect of the approach, which you should be made explicit in the paper.
5.b) Regarding reference [10], instead of citing the workshop version of the paper, please cite the more recent and improved journal version:
@article{DBLP:journals/semweb/LassilaSHBBBKLL23,
author = {Ora Lassila and
Michael Schmidt and
Olaf Hartig and
Brad Bebee and
Dave Bechberger and
Willem Broekema and
Ankesh Khandelwal and
Kelvin Lawrence and
Carlos{-}Manuel L{\'{o}}pez{-}Enr{\'{\i}}quez and
Ronak Sharda and
Bryan B. Thompson},
title = {The OneGraph Vision: Challenges of Breaking the Graph Model Lock-In},
journal = {Semantic Web},
volume = {14},
number = {1},
pages = {125--134},
year = {2023},
url = {https://doi.org/10.3233/SW-223273},
doi = {10.3233/SW-223273},
timestamp = {Fri, 20 Jan 2023 20:27:10 +0100},
biburl = {https://dblp.org/rec/journals/semweb/LassilaSHBBBKLL23.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
5.c) On page 2, line 3, you state that "RDF-star [...] does not provide exactly the same modeling capabilities as PGs" and the reference that you use to back up this statement is the RDF-star CG report [9]. I don't see how this report can be used as a reference for this statement. The report does not talk about PGs at all, let alone compares RDF-star and PGs.
5.d) On page 2, lines 10-11, you write that "the conversion from PG to RDF [is] without information loss, so that users can modify their data and convert them back to the original PG model." This does not sound right. If the RDF data resulting from the conversion is later modified, why would users expect it to be possible to convert the modified RDF data back to the original PG? What would be the point of modifying the RDF view of the data if, after converting back, the PG is still the original one?
5.e) In the context of Def.13, I suggest to say explicitly that this notion of "keys" is not meant to be something like the notion of keys as a form of integrity constraints that can be defined in a database schema. The reason why I mention this is because the section is about PG schemas. In fact, after reading that title of the section plus the first sentence and then seeing the word "key", I immediately assumed that the form of schemas that is defined here enables users to specify some form of key/uniqueness constraints.
5.f) Line 51 on page 13 provides a (fairly vague) informal definition of a notion of "reversibility" of functions and, then, claims that this function implies some properties that are described by two bullet points in lines 1-6 of page 14. I do not see how the second of these properties ("the inverse function [...] must be computable in reasonable time") is implied by the given informal definition of reversibility. This needs to be clarified. Also, what does "reasonable time" actually mean?
5.g) The second bullet point on page 14 continues with an example of public-key encryption functions for which it talks about "applying the encryption function on all possible outputs." I do not understand this statement. What "possible outputs" does it refer to? Is it possible outputs of the encryption function? But then why would the encryption function be applied on its possible outputs? Shouldn't it be a corresponding decryption function that is applied on the outputs of the encryption function?
5.h) Example 16 is wrong because, in contrast to the title of this example ("a trivially non reversible context"), ctx_\emptyset is in fact "reversible", at least in the sense that it satisfies the condition given in the paragraph before the example. The condition is satisfied as follows. For every BPG pg that is not the empty PG, the antecedent of the condition (i.e., ctx_\emptyset \in Ctx_pg) is not true and, thus, the condition is true. For the empty PG, the antecedent is true, but so is the consequent (because the prsc function / Algorithm 1 does indeed produce the empty RDF graph if given the empty PG).
5.i) At the beginning of Section 5.2.1, instead of directly putting the definition of this \kappa function without any further explanations, first the idea behind this \kappa function should be described informally in order to make it easier for the reader to understand what the formal definition aims to achieve.
5.j) While Example 17 illustrates an application of the \kappa function, none of the template triples in this example is nested. It would be helpful if there was also an example of \kappa being applied to a nested template triple. The reason for this is that it is not immediately obvious (and also not explicitly stated--see my previous point) that the template-triple version of \kappa (i.e., the second case in Def.22) is indeed recursive within itself. Providing an example with a nested template triple can make this fact more apparent to the readers (ideally in combination with te informal description of the idea of \kappa, as suggested in my previous point).
5.k) The definition of sign_ctx(type) in Def.24 is ambiguous. In particular, it is undefined what exactly sign_ctx(type) denotes in cases in which tps contains multiple template triples "that will produce triples that no other template in ctx can produce." For instance, what is sign_ctx(tn1) for the type tn1 in Example 18? A consequence of this ambiguity is that the use of sign_ctx(type) in the formula of Lemma 2 is unclear.
5.l) Algorithm 4, line 6: The purpose of the second condition (after the "or") is not clear to me. If the first condition (before the "or") is not true, then the second one cannot be true either. Hence, the second condition seems to be irrelevant.
5.m) Line 9 on page 24: It is not clear what you you mean by "cut". Additionally, this part comes out of the blue; it is not clear at all what this has to do with the theorem that came directly before it or with the topic of the section. Hence, there needs to be some text that makes the connection.
5.n) Lines 11-14 on page 24: A fifth bullet point should be added to this list, saying: "The triples in the template graph ctx(typeof_pg(m))."
5.o) Line 24 on page 24, "Proof of the algorithm": State explicitly what algorithm this is about. Also, where is the actual proof?? (i.e., where does "here" refer to?)
5.p) Line 24 on page 25: There should be some text right before Lemma 4 that introduces the lemma (e.g., what is the purpose of the lemma/why is it relevant).
5.q) Lines 33-35 on page 28: The latter part of the sentence about our related work is not entirely true. The mapping in my AMAR2019 paper [22] is based on the notion of an LPG-to-RDF* configuration, and the idea behind this notion is also present in the GRADES2022 paper [23] where it is captured via the mapping functions pnm, lm, idm, and elm (but, admittedly, not really exploited to its full potential in the experiments of that paper). This notion of an LPG-to-RDF* configuration gives users the flexibility to model the resulting RDF/RDF-star graphs with different RDF vocabularies for different types of PGs. In this sense, it is also a user-configurable mapping, not so different from the approach in your manuscript!
|