StarVers - Versioning and Timestamping RDF data by means of RDF* - An Approach based on Annotated Triples

Tracking #: 3097-4311

Authors: 
Filip Kovacevic
Fajar J. Ekaputra
Tomasz Miksa1
Andreas Rauber

Responsible editor: 
Harald Sack

Submission type: 
Full Paper
Abstract: 
Abstract. To foster reproducible and verifiable research results the RDA Data Citation Working Group issued a set of 14 recommendations. The core of these recommendations revolves around preparing data storages so that arbitrary subsets of any evolving data set can be efficiently identified and cited at any specific state or point in time. Based on these recommendations we identified an efficient solution for RDF/triple stores which so far have only used cumbersome mechanisms to aforementioned ends. Our solution employs RDF* and SPARQL* to annotate data triples with temporal metadata and thereby allows for retrieval of datasets as they were at a specific point in time. It furthermore solely relies on triple stores with RDF* and SPARQL* support, such as Jena TDB, GraphDB, Stardog and others and thus does not require any Git-like versioning systems. We evaluate our work by employing the BEAR framework where we use specific components, such as the BEAR-B dataset (hourly DBPedia snapshots), its corresponding queries, Jena TDB as triple store and the quads-based timestamp-based approach. We also extend the java implementation of this framework by methods and functions for handling RDF* queries and GraphDB as additional triple store. Our BEAR extension is publicly available on Github [1] . [1] w3id.org/fkresearch/starvers
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 23/May/2022
Suggestion:
Minor Revision
Review Comment:

This paper presents an approach to version RDF data. The work's primary motivation is to address some of the RDA Data Citation Working Group requirements that have been defined to foster reproducible research. In particular, the main problem is to enable a mechanism to retrieve specific versions of a dataset at a particular point in time.

The solution proposed by the authors is to annotate RDF triples by using RDF*; moreover, to query the RDF*-based dataset, the authors propose to use SPARQL*.

The paper's main point is to define an annotation mechanism based on timestamps. The authors describe two ways to annotate triples (flat and hierarchical) and then show SPARQL queries to handle the data dynamics and convert existing non-versioned data into versioned.
Then, an experimental evaluation is presented, which compares flat annotations, hierarchical annotations, and named graphs.

Pros:
(I) The paper is well-written and easy to follow
(ii) The overall organization looks good. Ditto for the explanation of the core idea

Cons:
(I) I noticed that several references in the bibliography are not cited in the paper.
(ii) The experimental evaluation is minimal and does not allow an account of the merit in terms of scalability. Being a journal submission, one would expect a more effective evaluation campaign.

Review #2
By Olaf Hartig submitted on 29/May/2022
Suggestion:
Reject
Review Comment:

The manuscript introduces an approach for timestamp-based versioning of RDF datasets. The approach uses RDF-star to annotate every triple of the dataset with a 'valid_from' timestamp and a 'valid_to' timestamp, where the latter may be in the far future to cover cases of triples that are "valid until further notice."

The authors describe how to insert and update triples with timestamp annotations by using SPARQL-star Update statements and they describe how to retrieve/materialize a version of the dataset at any given timestamp by using SPARQL-star queries. Additionally, as an evaluation of their approach, the authors have done a simple experiment by using a dataset of the BEAR RDF Archiving Benchmark with some triple pattern queries. The main observations of the experiment are i) that the tested systems (Jena TDB and GraphDB) achieve better query performance for some (unspecified) Named Graph representation of the timestamped data than for two variations of the authors' RDF-star approach and ii) that GraphDB achieves better performance than Jena TDB.

I do not consider the contributions of this manuscript sufficient for a journal article; in fact, I wouldn't even consider them sufficient for a conference paper in the main Semantic Web conferences. Instead, I would consider the contributions more as something for a workshop paper. The proposed approach is just a straightforward application of RDF-star and SPARQL-star, which is not even made transparent for users (instead, users are assumed to include the timestamp annotations and the timestamp-related query patterns manually), and the evaluation is very simplistic and only scratches the surface in terms of insights that it provides about the proposed approach. Moreover, the manuscript contains several small inaccuracies, and several details are missing or are not captured thoroughly.

Having said that, I am happy to see that the authors are attempting this work and I strongly encourage them to expand what they currently have. The remainder of this review elaborates more on the aforementioned issues and provides suggestions to improve and expand this work.

CONCEPTUAL CONTRIBUTIONS

To provide an actual conceptual contribution related to the proposed application of RDF-star and SPARQL-star, I would like to see a well-defined approach to make the RDF-star-based timestamping of RDF datasets transparent to the users. That is, I would like to see a foundation for *automatically* translating any given SPARQL Update statement into a SPARQL-star Update statement that adds or updates the relevant timestamp annotations. Similarly, for materialization-related queries, I would like to see a foundation for automatically translating a SPARQL query, together with a timestamp, into a SPARQL-star query that produces the result of the given SPARQL query over the version of the dataset at the given timestamp.

In addition to materialization-related queries, I would like to see a thorough discussion of how the proposed application of RDF-star can be leveraged for other types of data archive queries (timestamp retrieval, delta materialization, cross-version queries, etc). Related to that, I see that the authors wanted to focus on materialization queries, but I don't see any clearly-stated rationale for this focus; there are some vague references to some recommendations of a data citation working group but no concrete elaboration on these recommendations or no discussion of relevant requirements.

Smaller issues about the description of the proposal:

* While the idea to use the VALUES feature for inserting triples makes sense, I am certain that there is a practical limit to this idea. I mean, it is not possible to bulk load an unlimited number of triples in a single insert statement. This limit should be discussed and may also be something to be studied experimentally.

* The examples for the case of updates focuses only on updating a single triple. It may not be obvious to the reader how the updates are done when updating a combination of multiple triples or when bulk-updating multiple individual triples. Generally, as mentioned above, I would like to see a more generic treatment of how update statements need to be extended with the relevant timestamp-related patterns.

* The proposal for outdating a triple (Sec.3.6) requires that "an artificial valid_until timestamp must exist on that triple." While this requirement makes sense, there should also be a statement that specifies what happens, or what should happen, in cases in which the requirement is not satisfied.

* Section 3.7 claims that Tables 5 and 6 represent query results for the queries in Listings 9 and 10. That is incorrect because the values in the "Predicate" and the "Object" columns of the tables are not returned by the queries.

* The discussion related to DISTINCT at the end of Sec.3.7 does not make sense. Since the queries also contain the second FILTER condition (about ?valid_until), there would be no duplicates and no need for using DISTINCT (at least, if we assume that every data triple has only one vers:valid_from annotation and only one vers:valid_until annotation).

* In this discussion related to DISTINCT, the text mentions a condition with some "system_timestamp". It is not clear what that means.

EVALUATION

From a journal article I am expecting a much more comprehensive evaluation than what is provided in this manuscript.

* There is no study of the performance impact of the insert and update part of the proposal, although this part makes up 2/3 of the description of the proposal.

* The file-import part of the evaluation is somewhat unclear (and perhaps misleading?) because the baseline (some Named Graphs based approach) is not clearly defined.

* The evaluation is based on a single dataset and, additionally, there is no justification why only that dataset was used (considering that the BEAR benchmark consists of multiple datasets).

* The triple pattern lookup queries considered in the "Query Performance" experiment are a very simple form of queries. The authors do not provide any consideration of the practical relevance of such queries; how much can we actually learn about the approaches from such simple queries?

* There is no discussion whatsoever of the observations that can made from the measurements. Why is there such a huge reduction of the file size and the memory footprint when converting the Named Graphs representation into the RDF-star-based representations? Why do the systems achieve a better query performance for the Named Graphs representation of the timestamped data when compared to the RDF-star-based representations? Why does GraphDB achieve a better performance than Jena TDB? Is the hypothesis that "we expect a better performance with former ones" (i.e., predicate-lookup-queries)" actually verified by the experiments? etc.

* The claim that "using the proposed approach, even large scale and highly dynamic RDF datasets can be efficiently versioned" is absolutely not justified by the presented evalution!
--> A few hundred MB are not "large scale" and neither are a few GB.
--> Also, there is nothing about "highly dynamic RDF datasets" in this evaluation; the authors just imported a file in which multiple dataset versions are represented. (see also my comment above about the lack of an evaluation of the insert and update part of the proposal)

Further smaller issues in the section about the evaluation:

* It needs to be specified which version of each of the systems was used exactly.

* Readers may not know what a ".ttl file" is and how it may be used to serialize datasets with RDF-star triples. In fact, the last paragraph of Sec.5 makes even me wonder what exactly the authors have done ("Once the turtle-star RDF serialization format becomes widely adopted we will fit our datasets into this format.") Does this mean the authors have not actually used Turtle-star for the serialization, but plain Turtle? How was it possible to represent the nested RDF-star triples then??

* What is the purpose of the "shell script" mentioned in Sec.4.1?

* The authors use the terms "compressed" and "compression" in several places, which is highly misleading because they have not actually used any compression techniques. Instead, there is simply a reduction of the dataset size (measured in terms of file size) after converting data from the Named Graphs representation to the RDF-star-based representations.

* It is not clear how to read things such as "173-256MB" in Sec.4.2.

* The authors should clarify what they mean by "storage consumption scaling factors".

* Table 7 says "mean ingestion time" but I don't see any indication of the number of file-import runs have been tried to calculate a mean from.

RELATED WORK

The "Related Work" section needs to be improved as well. Currently, it appears as a semi-organized collection of some related work, mixed up with an introduction of the background of the presented work. Additionally, there is a paper that has two entries in the bibliography (namely, [2] and [17]), and there are entries for which it is not clear where these papers have been published (e.g., [3], [41], [42]).

Finally, I suggest to reference also the W3C Community Group Report about RDF-star and SPARQL-star as this is the most recent document about the approach, and in this context I also suggest to use the terms RDF-star and SPARQL-star instead of RDF* and SPARQL*.

Review #3
By Ruben Taelman submitted on 09/Jun/2022
Suggestion:
Major Revision
Review Comment:

This article introduces two methods for representing RDF archives via RDF*,
and corresponding methods for retrieving data for certain timestamps via SPARQL*.
The authors evaluate their approach using Jena and GraphDB within a modified version of the BEAR benchmark.
Results show that while storage size can be lower when using RDF* compared to named graphs,
the query execution time for RDF* is significantly higher than that for named graphs.
While the approach is interesting and novel, there are some issues with this work that need to be resolved,
which I will discuss below.

## Experiments are minimal

It is unclear why the authors have decided to only use the BEAR-B hourly dataset and queries.
However, the BEAR benchmark also includes BEAR-A, the other BEAR-B variants, and BEAR-C.
In order to evaluate the approach for different chokepoints (larger datasets, different queries, ...),
it would be good to also evaluate this approach on the other BEAR datasets.

Furthermore, while I understand that the authors introduce their solution as a TB approach,
it is unclear why they don't compare their approach with IC and CB systems, which are provided by the BEAR benchmark.
Because currently, it is unclear how this approach positions itself to these other approaches.
Furthermore, including standalone RDF archiving systems (such as OSTRICH) within the comparison
would also make the evaluation a lot stronger.

## Use of DISTINCT keyword may be problematic

In section 3.7, the authors explain that version materialization requires the use of the DISTINCT keyword to cope with the fact that duplicate triples may exist in a special case.
This dependency on the DISTINCT keyword may be problematic for other cases where duplicate results can be produced, and where this duplication is desired by the end-user.
Are there workarounds possible for this?

## Comparison to named graphs should be clarified

In the introduction, the authors say: "Last, we summarize our work and provide arguments why our RDF* approach to versioning RDF datasets should be preferred over named graphs as BEAR’s timestamp-based reference approach."
I was looking for this argumentation, but I could not find it.
I also wonder what this argumentation would be, since it looks to me like the named graphs approach is significantly faster than the RDF* approach.
This is also something that should be clarified upfront, since this fact only becomes clear at the end of the paper, while it would have been good to have know this at the start of the paper (even in the abstract).

## Minor issues

* The approach is similar to "Cuevas, Ignacio, and Aidan Hogan. "Versioned Queries over RDF Archives: All You Need is SPARQL?." MEPDaW@ ISWC. 2020."
But they use named graphs instead of nested triples. It would be interesting to discuss the differences with this work.
* Typo on page 6: "each tripe"
* Page 12: "We build our evaluation framework on top of OSTRICH which is a more stable fork of the original BEAR repository on Github and a synonym for a more recent publication featuring BEAR as evaluation framework"
OSTRICH is not a fork of the BEAR benchmark. OSTRICH is an RDF archiving solution that also makes use of the BEAR benchmark. The authors of OSTRICH just happened to create a fork of BEAR with some minor enhancements.
* The canonical citation for OSTRICH is "Triple Storage for Random-Access Versioned Querying of RDF Archives", instead of "OSTRICH: versioned random-access triple store".
* Possible typo on page 15: "< 250s per query" (should be ms?)
* "3.2. Representing timestamped RDF triples with RDF*"
An alternative representation that would have been interesting to investigate would be something like <> vers:validity [ vers:valid_from "..."; vers:valid_until "..." ].