Expressive Querying and Scalable Management of Large RDF Archives

Tracking #: 3625-4839

Authors: 
Olivier Pelgrin
Ruben Taelman
Luis Galàrraga
Katja Hose

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
Abstract: 
The proliferation of large and ever-growing RDF datasets has sparked a need for robust and performant RDF archiving systems. In order to tackle this challenge, several solutions have been proposed throughout the years, including archiving systems based on independent copies, time-based indexes, and change-based approaches. In recent years, modern solutions combine several of the above mentioned paradigms. In particular, aggregated changesets of time-annotated triples have showcased a noteworthy ability to handle and query relatively large RDF archives. However, such approaches still suffer from scalability issues, notably at ingestion time. This makes the use of these solutions prohibitive for large revision histories. Furthermore, applications for such systems remain often constrained by their limited querying abilities, where SPARQL is often left out in favor of single triple-pattern queries. In this paper, we propose a hybrid storage approach based on aggregated changesets, snapshots, and multiple delta chains that additionally provides full querying SPARQL on RDF archives. This is done by interfacing our system with a modified SPARQL query engine. We evaluate our system with different snapshot creation strategies on the BEAR benchmark for RDF archives and showcase improvements of up to one order of magnitude in ingestion speed compared to state-of-the-art approaches, while keeping competitive querying performance. Furthermore, we demonstrate our SPARQL query processing capabilities on the BEAR-C variant of BEAR. This is, to the best of our knowledge, the first openly-available endeavor that provides full SPARQL querying on RDF archives.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Guillermo de Bernardo submitted on 21/Mar/2024
Suggestion:
Major Revision
Review Comment:

Summary
--------

The authors present a number of significant contributions, expanding previous work in the representation of RDF archives to improve performance and add new query capabilities. The key idea in the article is to improve OSTRICH, a solution based on aggregated delta chains. First, the authors describe strategies to use multiple delta chains, potentially improving space requirements, ingestion times and query times. Additionally, the authors improve previous work by introducing a new encoding schema that reduces disk usage and ingestion time. Finally, the authors describe how to enhance the system to allow SPARQL queries over the RDF archive. The experimental evaluation expands previous work by studying the multiple-delta-chain strategy in all the datasets of the BEAR benchmark, and tests the performance of the new contributions.

The paper is well written and the problem is worthy of attention. The combination of all the contributions yields several steps forward in the context of RDF archives, and significantly improves the baseline OSTRICH tool in many scenarios. In terms of query performance, I find that the overall relevance of the proposal (i.e., its position in the state of the art) is not easy to evaluate without additional discussion or results. On the other hand, the improvement over the baseline is promising, and the experimental evaluation with the BEAR-C benchmark is a significant contribution that provides a baseline for future work.

Compliance with open science data
---------------------------------

The authors provide several links to resources, that can be used to partially reproduce the experimental evaluation:
- The Zenodo dataset linked in the article is well organized and easy to use, but it appears to be the same file used in previous work [13]. It provides code and instructions to reproduce the experiments in Section 8.3, and most of the results in Section 8.2.
- There are no instructions on how to reproduce the experiments in Section 8.4, and I did not find in the linked resources any obvious references to switch between metadata representations.
- Several sources are linked in relation to SPARQL support, covered in Section 8.5. The contents of the Github repositories should be enough to run experiments using SPARQL, but there is no specific information on how to reproduce the experiments in the article.

Following the journal guidelines, a single file should be provided, including all the necessary resources. I would suggest creating a new Zenodo file, following a similar structure to the previous Zenodo link, to include the essential information to reproduce all the experiments.

Main comments: suggestions in the experiments section
--------------

I find that the use of OSTRICH as the only benchmark makes it difficult to ponder the significance of the results in relation to query performance. The authors achieve better space requirements, ingestion times and query times than the OSTRICH baseline. However, they omit any comparison with the reference systems included with BEAR, with the claim that they are outperformed by OSTRICH. The results in [17], however, require a more nuanced explanation: OSTRICH obtained good results, and offered a good tradeoff, but it was neither the smallest nor the fastest alternative. This leaves several open questions hidden behind the claim that OSTRICH outperforms the BEAR baselines:
- Would the multiple-snapshot strategy on top of OSTRICH be faster than the BEAR reference systems at ingestion time? The comparison with other alternatives may not be completely fair, as in [17], but the improvement of one order of magnitude compared to the OSTRICH baseline is not so relevant if the system is still slower than the alternatives.
- Would the multiple-snapshot strategy be faster (for VM, DM queries) than the HDT-based systems? Would it be faster for V queries than the BEAR baselines? It will surely be more competitive, but it may not outperform at least the HDT-based alternatives.
- Would the multiple-snapshot strategy use less space than the alternatives? Probably yes, except for HDT-CB.
A full comparison with the reference systems included with BEAR may not be necessary to demonstrate most of the current improvements, but limited experimental evaluation could easily give an answer to the previous questions. At a minimum, some additional discussion is necessary to place the contributions of this work within the state of the art in terms of performance for triple pattern queries, or to justify the omission of the BEAR reference systems as baselines.

Section 8.4 only displays results of the compressed representation for BEAR-B. The improvement is impressive, especially at ingestion time, so it would be useful to know if a similar improvement can be expected also on BEAR-A (if reasonable to compute) or BEAR-C, that have few versions and bulky delta chains. BEAR-C ingestion times and disk usage with the compressed representation are relevant since it is used in Section 8.5, and I assume that the results in Table 6 are obtained from to the version with uncompressed metadata.

Other comments
--------------

The third contribution described in page 2 states mentions an extended evaluation of previous work with additional baselines. However, the only addition in Section 8.3, compared to [13], seems to be the BEAR-A baseline, and the evaluation of ingestion time and disk usage on BEAR-C.

The choice of keeping snapshots also as aggregated deltas is briefly discussed in Section 4.1. This is used to improve DM queries between consecutive snapshots, but it should also require extra disk space and ingestion time. Are there other practical advantages of this choice? I would suggest discussing the benefits of the optimization in the experimental evaluation.

In Section 4.2, the change-ratio strategy computes the sum of change ratios for all the versions in the current delta chain to determine whether to build a new snapshot. The explanation in p.7, l.25 implies that this calculation is used to estimate the amount of data not materialized in the snapshot. However, the value of \delta_{s,k} is already a good estimation of this; the same could be said of aggregating the relative deltas \delta_{i,i+1} for all snapshots. If the goal is to more accurately estimate the redundancy in the delta chain, this should be explained. As a side note, it is not clear if the same metric (or at least the same value of gamma) should be equally effective before and after changing the metadata encoding as described in Section 6.

In Table 2, the notation for the first two rows is confusing: $u_{i,j}^+$ and $u_{i,j}^$ should be $u_{k}^+$ and $u_{k}^-$.

In Table 2 it is assumed that the change set sizes can be added up directly to compute the deltas. I suggest that you provide an example of the delta calculation, or simply state that changesets are independent.

Algorithms 1, 2, and 4 provide limited added value, since they could be summarized as the OSTRICH algorithms applied to one of the many delta chains. In any case, the pseudocode of Algorithms 1 and 2 may be useful to make the article self-contained, as well as to introduce the notation.

A few small typos in the first algorithms:
- Alg. 2, line 9: replace $u_j^+$ by $u_j^-$
- Alg. 2, line 11: replace the second $u_j^-$ by $u_i^-$
- Comments in Algorithms 1 and 2: "correspond" -> "corresponds"

In Algorithm 3, l. 15, $delta$ should be $(u_i, u_j)$. I would suggest using the same notation inside snapshotDiff instead of referring to a generic $delta$, so that notation is consistent also with Algorithm 2.

The explanation of Algorithm 4 is incomplete: it is stated that the algorithm iterates over the triples that match p in the snapshot, but the additions are not mentioned. The behavior of "queryAdditions" should be included in the explanation, and the description of line 6 should be adjusted.

Other small details found in the text:

p. 5, l. 31: "a HDT" -> "an HDT"
p. 8, l. 44: "correspond" -> "corresponds"
p. 11, l. 24: "describe" -> "describes"
p. 12, l. 3: "paramount to functioning" -> "paramount to the functioning" (?)
p. 12, l. 4: "the multiple aspects" -> "multiple aspects" (?)
p. 13, l. 17: "exists" -> "exist"
p. 13, l. 23: "Our experiments Section 8.4" -> "Our experiments (Section 8.4)" or "Our experiments in Section 8.4"
p. 14, l. 33 "create" -> "created"/"creates"
p. 16, l. 50 "can ingest the entire history (in around 26 hours)" -> mention the dataset

Review #2
By Edgard Marx submitted on 04/Apr/2024
Suggestion:
Accept
Review Comment:

#General

In this work, the authors propose an approach for scalable management of large RDF archives.
The system is built on top of OSTRICH and a SPARQL engine.
The evaluation is sound, and the paper is reasonably well written.
However, I think it would be better if the author performed a comparison with existing approaches such as QUIT, which also supports full SPARQL queries.
Which might make the sentence below misleading:

"This is, to the best of our knowledge, the first openly-available endeavor
that provides full SPARQL querying on RDF archives"

I would recommend the work to be published, given the authors address my comments below.

#Related Work

- "Such an approach allows for a more efficient version materialization process than conventional delta chains." -> That is true, but you would have redundant storage problems since each version might has the delta of previous versions.

#Approach

##Section 5

- It seems to me that Algorithms 1 and 4 may lead to incorrect results.
In the algorithms, it is shown that there might be multiple additions and deletions in different versions.
That said, consider the following scenario:

A user adds a property p', regrets it, and removes it.
Because the algorithm deals with multiple deletions first and additions later, it should lead
to an inconsistency, p' might appear as a false positive in a recovered version.
Unless there is some rule that prevents it but it is not explained in the article.

#Evaluation

Although it contains a comprehensive analysis of the approach in different setups, I think this section misses a comparison with existing approaches, e.g., Quit Store.

#Minors

- Please create a table stating your baseline and other approaches. This will simplify the reading.
- the first time OSTRICH appears in the paper is without reference.
- line 9 of Algorithm 2 is wrong deletions (u_j-) are represented as additions (u_j+).

Review #3
By Panagiotis Papadakos submitted on 01/May/2024
Suggestion:
Major Revision
Review Comment:

This paper is interesting since it discusses an approach for RDF archiving and querying based on multiple snapshots and aggregated deltas. The authors propose a compression scheme for versioning metadata and evaluate their approach along the dimensions of ingestion time, size and querying efficiency.

Generally the paper is well written. However I am missing a running example that would really help the reader understand all the details of the approach. For example the SPARQL example given in 7.1 is a good candidate. However, this is not a single triple pattern querying example which leads to my second major comment.

I think it would really benefit the paper a more extensive discussion regarding the differences of a single triple pattern query and a multiple triple pattern query as used in most SPARQL queries for archiving. The provided results in 8.5 is interesting, however it fails to provide any insights about the requirements and peculiarities of such queries in the context of an archive system. I also miss a discussion about the types of 11 queries used and how they affect the performance.

Some other comments follow:

- No discussion of the applicability of the proposed approach for CV/CD queries
- Please provide more details about the additional metadata stored in the deletion triple index. The paper should be self-contained and there should be no need to read external papers (e.g. for OSTRICH). The running example would be really helpful here.
- I am missing a discussion and some numbers regarding the snapshot numbers and sizes along with the size of the delta chains in the evaluation section for the different strategies.
- The authors mention the Variable Size Integer Encoding as an extra compression scheme. Is this scheme used? Else the authors could consider adding numbers regarding those
- I did not understand whether the querying is done by having the whole archive data and indexes in memory or if there are disk accesses
- Maybe it would be better if the experiments were done using SSD disks instead of an HDD one which probably amplifies the I/O cost.
- It would be nice to provide a comprehensive table with guidelines about choosing different strategies for different archives.
- Figure 7. In (b) something is wrong with graph.