Performing live time-traversal queries on RDF datasets

Tracking #: 2969-4183

Arcangelo Massari
Silvio Peroni

Responsible editor: 
Katja Hose

Submission type: 
Full Paper
This article introduces a methodology to perform live time-traversal queries on RDF datasets and software based on this procedure. It offers a solution to manage the provenance and change-tracking of entities described using RDF. Although these two aspects are crucial factors in ensuring verifiability and trust, some of the most prominent knowledge bases – including DBpedia, Wikidata, Yago, and the Dynamic Linked Data Observatory – do not support time-agnostic SPARQL queries, i.e. queries across the various statuses an entity may have assumed in time. The OpenCitations Data Model (OCDM) describes one possible way to track provenance and entities' changes in RDF datasets, and it allows restoring an entity to a specific status in time (i.e. a snapshot) by applying SPARQL update queries. The methodology and library presented in this article are based on the rationale introduced in the OCDM. We also develop benchmarks proving that such a procedure is efficient for specific queries and less efficient for others. To date, as far as we know, our library is the only software supporting all the time-related retrieval functionalities without pre-indexing data.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 04/Jan/2022
Review Comment:

The article proposes a methodology to execute live time-traversal SPARQL queries on standard RDF engines. A time-traversal query is a standard query defined on some elements of the history of a dataset, e.g., a revision or a changeset. In this paper, that history is modeled using the OpenCitation Data Model (OCDM) that builds upon the PROV data model, and defines a data revision every time an RDF resource is updated. The metadata for changesets and revisions is stored in named graphs, and the actual updates are stored as SPARQL update queries. The paper formalizes the execution of the different types of time-traversal SPARQL queries and elaborates on the design and implementation of a Python library designed to confer time-traversal semantics to SPARQL SELECT queries. An experimental evaluation shows the runtime and memory consumption of the proposed approach using BlazeGraph as storage engine on the Scientometrics dataset. The evaluation suggests reasonable runtimes for the different query types, and at the same time illustrates the need for more tailored query planning for such types of queries.

The topic addressed by the paper is a good match for the SWJ and tackles a central problem in the Semantic Web, namely, the lack of native support for time-traversal queries on ever-changing RDF data, which is of great value for data producers and maintainers. The paper assumes the use of OCDM and proposes this standard as an RDF-compliant data model to store the history of a dataset, and proposes an abstraction library that builds upon standard SPARQL to execute queries defined on the history of an RDF collection. While I believe this work constitutes a valuable contribution to the SW community, it still requires a lot of work to be publishable. I will elaborate on my recommendations in the following:

- The related work is comprehensive and pertinent. That said, the paper has disregarded a few citations, in particular those related to SPARQL extensions for time-travel queries:

** Shi Gao, Jiaqi Gu, and Carlo Zaniolo. RDF-TX: A Fast, User-Friendly System for Querying the History of RDF Knowledge Bases. In Proceedings of the Extended Semantic Web Conference, pages 269–280, 2016. doi:10.1016/j.ic.2017.08.012.

** Valeria Fionda, Melisachew Wudage Chekol, and Giuseppe Pirrò. Gize: A Time Warp in the Web of Data. In International Semantic Web Conference (Posters & Demos), volume 1690, 2016.

A comprehensive survey of RDF archiving can be found at:

** Olivier Pelgrin, Luis Galárraga, Katja Hose. Towards Fully-fledged Archiving for RDF Datasets. Semantic Web Journal, 2021.

I think the non-reliance on syntactic extensions for SPARQL is one of the strengths of the paper. However, I think the current version lacks a formalization of the semantics of the different query types. This is important since this work also provides a concrete implementation for those queries. I would encourage the authors to formalize the notions of time-aware dataset and time-travel query, and define rigorously what we expect as result when we run each query type.

The most important weakness of the paper's current version lie on its experimental evaluation, which is unfortunately too poor:

- It seems as the used benchmark is part of the paper's contribution. If so, it should be more comprehensive: Two SPARQL queries are simply not enough. While they cover the two general cases -- known/unknown subjects --, there are still some factors that can be varied in order to provide a more comprehensive evaluation: the predicates occurring the queries, the dynamicity of the chosen subject, the form of the triple patterns, etc. If more SPARQL queries are included, the paper should also provide a more detailed description of them (number of triple patterns, shape, type of predicates, etc).

- Are queries executed multiple times and their times averaged? (the standard deviation should be provided). What about reordering the execution of the queries (in this case, alternating between queries 1 and 2) to challenge the Blazegraph cache?

- Without a reference, the numbers in table 5 are completely uniformative. Perhaps a baseline for comparison could be the query runtime on the latest snapshot/change. This could provide a hint of the overhead induced by time-traversal, including a per-snapshot/per-update overhead, that is, the result of dividing the total runtime by the number of revisions/updates involved in the query execution. That would be more informative than an absolute number as it is shown now.

- While the previous recommendations would turn the current evaluation more informative, it is not enough from a scientific point of view. The paper should show that the overall approach (data model + proposed strategy) is viable for SW practitioners, and that brings something new w.r.t. the state of the art. I therefore believe that a broader comparison with existing solutions is mandatory. This could include either an evaluation of the approach using another graph store, e.g., Virtuoso, Fuseki, (I understand it may be tricky since not all solutions have sophisticated text search functionalities, an alternative could be QLever or a comparison with OSTRICH (limited to the queries when this is possible).

This would certainly change the content of the discussion section.

A few typos:

- Page 5, "having discussed RDF extension" -> extensions
- Page 10, 10.1111/j.1365 2648.2012.06023.x -> no difference, missing trailing period

Review #2
By Ruben Taelman submitted on 05/Jan/2022
Major Revision
Review Comment:

This article introduces a document-oriented approach for representing RDF archives using a change-based storage approach,
where diffs on entities are stored as SPARQL update queries.
On top of this storage approach, the authors introduce algorithms for querying such archives.
The implementation of this approach is available as an open-source library,
and its performance is evaluated using a custom benchmarks, which appears to be fully reproducible.
Results show that queries for a single entity and version are significantly faster than queries for single or variable entities across all versions.

While I consider this work very relevant, original, well-written and interesting, there are several problems with it in its current state, which will be listed below.

## 1. Unclear query algorithms

My primary concern with this work is with the discussion around the query approaches.
Each query approach is currently explained in the text and with a flow chart.
Nevertheless, there seem to be many details missing in order to precisely understand how each algorithm works.
I think explaining the algorithms together with detailed pseudocode might help solve this problem.
This could also form the basis for a proof of correctness, which also appears to be missing at the moment.

For example, it is unclear how the delta materialization algorithm works.
The authors say that "no operations are needed" for this due to the way the data is stored.
But it is unclear then what needs to happen exactly for the results to be formed.
For instance, I would expect some kind of merge operation to occur when a resource has multiple snapshots/update queries.

## 2. Missing details regarding SPARQL update queries

The storage approach is missing details around the way SPARQL update queries are stored.
Based on the examples, it seems like only INSERT DATA and DELETE DATA queries are allowed, and update queries with a WHERE clause are not.
Furthermore, it seems like certain operations depend on string matching of URLs, which requires no prefixes to be used within queries.
It is also unclear if blank nodes are allowed.
It seems crucial for these kinds of details around the storage approach to be formalized.

## 3. No comparison with other systems

It is unclear why the authors did not include a comparison of their approach with other RDF archiving systems,
and made use of existing RDF archiving benchmarks such as BEAR [1].

Since such a comparison is missing, it is difficult to understand the cost-benefit of this approach wrt to other approaches that rely on pre-indexing.
For instance, the authors make the following claim in the discussion section: "Regarding structured queries, they are swift if all subjects are known or deductible by explicating the variables recursively in linked triple patterns."
This claim is hard to prove without some kind of comparison with existing systems to place "swiftness" in the proper context.

## 4. Incomplete evaluation

The evaluation with known subjects seems to be hardcoded to always run on the entity with URL ending with "86766".
As far as I understand, the dataset contains more than a million entities, so it is unclear why the authors are only running experiments for a single one.
Based on the presented results, there is no way of knowing of this single entity produces results that are representative for all entities.
For instance, it is unclear how many versions this entity has, and how performance will be for entities with more or less versions.
To obtain a more complete view on this, I would recommend the authors to include much more entities in the experiment.

## 5. Usage of retrieval functionality terminology

In section 2.3, the authors talk about the retrieval functionality categorization from Fernández, Polleres and Umbrich.
More recent RDF archiving papers make use of the more low-level "foundational query atoms",
which are considered more fundamental for covering all retrieval demands.
It's unclear why the authors have positioned this work using this older categorization.
In any case, it would be good if the authors would explain how their work is positioned wrt these query atoms,
as this enables better comparison with other systems.

The usage of this other terminology is probably the cause of a mistake in table 5 (which may fit better in the related work section) regarding the available functionality in other systems.
For instance, the OSTRICH system supports all retrieval functionality (but not live indeed).

## 6. Confusion regarding lack of pre-indexing

The paper mentions in several places that it does not require pre-indexing, such as on page 30: "Therefore, to date, as far as we know, time-agnostic library is the only software to support all retrieval functionalities without requiring pre-indexing processes."
As far as I understand, the time-agnostic library in fact *does* require pre-indexing of the latest version, which is stored in a triple store (e.g. Blazegraph), which does make use of indexes.
Furthermore, the requirement to attach SPARQL update queries on entities, and the usage of the Blazegraph text index also forms an indexing process.
Therefore, it seems imprecise to consider this work as not requiring pre-indexing.

## Minor issues

* Typo: page 5: "49% less triple than RDF Reification"
* Incorrect OSTRICH citation, should be [2].
* Page 9: "OSTRICH supports version materialization, delta materialization, and single-version queries." Following the used terminology in this paper, it also supports the structured query variants.
* Typo: page 18: "does not speed the relevant entities’ discovery"
* Are there any measurements of how large the cache can become?
* It would be nice if the time-agnostic-browser would run on a web server, so that readers can use it without having to setup the system locally.
* The proposed approach seems be related to TailR [3] and Quit store [4], so it would be good to include them in related work.
* Does this approach support the full SPARQL 1.1 syntax? E.g., the authors talk about handling triple patterns, but not property paths, so it is unclear whether or not these are supported as well.
* Page 13: "Since deltas are saved as SPARQL strings, a textual search on all available deltas can be executed to find those containing the known URIs." This is not always guaranteed to succeed. For example, if IRIs are prefixed differently, or if a BASE is used in the query. To ensure correctness, each query should be parsed, and comparison should be done over the abstract syntax tree.

## References

[1] Fernández, Javier D., et al. "Evaluating query and storage strategies for RDF archives." Semantic Web 10.2 (2019): 247-291.
[2] Taelman, Ruben, et al. "Triple storage for random-access versioned querying of RDF archives." Journal of Web Semantics 54 (2019): 4-28.
[3] Meinhardt, Paul, Magnus Knuth, and Harald Sack. "TailR: a platform for preserving history on the web of data." Proceedings of the 11th International Conference on Semantic Systems. 2015.
[4] Arndt, Natanael, et al. "Decentralized collaborative knowledge management using git." Journal of Web Semantics 54 (2019): 29-47.

Review #3
Anonymous submitted on 30/Jan/2022
Major Revision
Review Comment:

Performing live time-traversal queries on RDF datasets
Arcangelo Massari and Silvio Peroni

The paper focus on querying datasets over time (time-traversal queries) ie. taking into account that dataset changes over time. A basic use-case is to be able to query a dataset at a given time (ex: lectures given by a certain teacher at time ti), or to query over time (ex: people that have been student and teacher for the same course).

Querying a dataset over time is a well-known topic in semantic web, with an important state of the art.

The paper states that “all existing solutions need to preprocess and index RDF data to work efficiently”, and that this preprocessing is not tractable with datasets that change frequently. On the other hand, “live system” (those that do not require pre-processing), only provides materialisation of states or delta without query processing integration.

Well, I need more evidence (with complexities/numbers) that pre-processing//indexing makes “existing solution” not tractable in presence of frequent changes.

Next, the paper proposes an approach based on the Open Citation Data Model (OCDM) to combine data and histories (following a change-based policy) that allow single-version, cross-version, single-delta, cross-delta structured query processing. The paper evaluates the performance of the proposal on an OCDM dataset with some queries.

Strong points:
* I agree that proposing a working system allowing all kinds of time-traversal queries is a valuable contribution.
* The “live” constraint for executing time-traversal is interesting
* The system has been experimented
Weak points:
* Some inconsistencies in the structure of the paper (see table 5 remarks below)
* Limitations of state of art needs more evidences
* Related works need more precision.
* Lacks of definitions/algorithms/complexities in proposal.
* All kinds of queries are not supported with the same performance. Some non-selective queries can be not tractable.
* Experimentation has no baseline, and use no benchmark.

Section introduction. I think that the Introduction is not very convincing. Limitations of state of the art are quickly/poorly described, to the detriment of a long description of OCDM (that could be described more precisely later).

Related Works:

I’m quite surprised to not see some examples of different types of time-traversal queries as described in [44] (just to explain the context) ie [44] is cited, but I had to read [44] again to really understand this paper.

How is the OCDM storage model positioned compared to [31] and R&WBase? Is it similar?

The semantic web journal published [TFF20], but this publication is not cited (i’m not an author of this paper).

Table 1 is interesting, but how is the “scalable column” decided? I don’t see why RDF* is not scalable. Can you give evidence that RDF* does not scale? Also concerning the standard, i agree that RDF* is currently not a standard, but there is a W3C working group, with a first draft published and quite solid implementations available. Is it possible to represent the OCDM model in RDF* ??[a]

I really do not understand why Table 5 is at the end of the paper and not in related works. In fact table 5 is somehow described in the text of section 2.3 and table 5 resumes everything at the end of the paper.

As it is written, at the end of the related works, we don’t really know what is the scientific problem to solve. A strong restructuration of the paper is needed to leave the related with a crystal clear problem to solve.

Section 3 describes the proposal.

The model is not easy to understand. Many drawings are provided (fig3), but finally, the reader has to read listing 1 to see really where the first “hasUpdateQuery” is located. So examples are quite difficult to follow. For example, it requires time to understand the se1/se2 contains the history of the entity. (I think there is a Typo P10 line 38 “10.1111/j.1365 2648.2012.06023.x” instead of “10.1111/j.1365 2648.2012.06023.x” ).

Section 3.1 presents how it is possible to materialise Version and Delta from the OCDM model. I’m not a big fan of flow charts to describe how to do things. It lacks definitions, it lacks precision, and at the end it is nearly impossible to estimate complexities of operations. For these operations, is the proposal better than competitors (PromptDiff, SemVersion, R&WBase) at the algorithmical level?

Section 3.2 makes the lack of definition/algo more problematic. What is precisely an isolated triple pattern? What is an explicit solvable variable? As definitions are not present, it takes a lot of time (reading and re-reading) to really understand how the approach works. The same remark with flow charts also applies here.

I understood that the main issue with this approach is to materialise only the required entities at the right point in time during query evaluation. Of course the approach works well, if there is a root entity to start from. If not, the number of entities to materialise can be in the worst case the whole dataset. Such complexities are important to understand the limitation of the approach and, as it is not present, it is a serious drawback for the paper.

Again, on the algorithmic part, how can this approach be compared with v-RDFCSA in terms of complexities ?? It seems to me that the proposed approach can be catastrophic if non-selective queries.

For me, Section 3 lacks definition, algorithms and complexity analysis. This is required to understand the limitations of the approach compared to baseline. As stated by the authors, the approach should work well when at least a subject is a literal leading to a selective query.

Section4 describes the implementation. It looks like a user manual and is quite boring to read. We understand that the cache is important for the library, but it lacks, at least, a simple formalisation to understand how the cache works.

Section 5 describes the evaluation of the proposal. There are several problems with this evaluation. First, we don’t know the questions that the evaluation should answer. What is expected from this experimentation ?

Second, there is no baseline. According to table 5, for materialisation of all version/single version, PromptDiff,SemVersion/R&WBase are competitors. So the proposal is faster or slower? Even if OSTRICH/v-RDFCSA do not support the “Live” setup. On this setup, it was possible to compare performances of both approaches.

Third, there is no real benchmark. How can I know if proposed queries are challenging? In [TFF20], several benchmarks are cited as BEAR, some experiments are also conducted. How your system is positioned?

For me, the results of the evaluation part are not very useful as it is presented, because there is no baseline to compare with.

Section 6 presents a discussion.

Some of the limitations of the proposal are presented. Although the system is able to support all kind of time-traversal queries, good performances is achieved only for a small class of queries ie. queries with literal as subject. This reduces significantly the usefulness of the proposal.

I really do understand why table 5 appears only in section 6. For me, it should be moved to related work to present the positioning of the proposal vs the state of the art.

Overall, the paper has some good elements (supporting all kinds of time-traversal queries), but, as it is, the paper is closer to a resource paper than a research paper. If authors prefer writing a “resource paper”, better description of potential use-case is required.

On originality, the paper highlight the constraint of “live” time-traversal queries ie. querying without pre-processing. But this constraint needs to be to be better motivated.

On significance of the results, the lack of complexities when describing algorithms, and baselines during experimentations is really problematic.

On the quality of writing, the paper suffers from significant anomalies in the structure.