Evaluating Query and Storage Strategies for RDF Archives

Tracking #: 1814-3027

Javier D. Fernandez
Juergen Umbrich
Axel Polleres
Magnus Knuth

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
There is an emerging demand on efficiently archiving and (temporal) querying different versions of evolving semantic Web data. As novel archiving systems are starting to address this challenge, foundations/standards for benchmarking RDF archives are needed to evaluate its storage space efficiency and the performance of different retrieval operations. To this end, we provide theoretical foundations on the design of data and queries to evaluate emerging RDF archiving systems. Then, we instantiate these foundations along a concrete set of queries on the basis of a real-world evolving datasets. Finally, we perform an extensive empirical evaluation of current archiving techniques and querying strategies, which is meant to serve as a baseline of future developments on querying archives of evolving RDF data.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Nick Bassiliades submitted on 01/Feb/2018
Review Comment:

Authors have successfully handled and responded to all my comments. The text has improved quite a lot, so I suggest to be accepted.

Review #2
Anonymous submitted on 25/Feb/2018
Review Comment:

The authors addressed the comments in the previous review and the paper is now ready for publication.

Review #3
By David Corsar submitted on 23/Mar/2018
Minor Revision
Review Comment:

This paper sets out a benchmark for evaluating the performance (in terms of space requirements and query retrieval times) of RDF archives. As discussed by previous reviewers, the majority of the first few pages have been previously published in a paper of the same name published at SEMANTICS 2016. This includes review of six types of retrieval queries (version materialisation, single-version structured queries, cross-version structured queries, delta materialisation, single-delta structure queries, and cross-delta structured queries), discussion of approaches to RDF archiving (independent copies (IC), changed-based (CB), timestamp-based (TB), and hybrid-based (HB) approaches), formalisation of the features that characterise data and five types of queries proposed to for evaluating RDF archives. Definitions of how these queries can be instantiated using AnQL are provided for three RDF archive approaches (IC, CB, TB). Three datasets along with their own sets of queries that are used in the evaluation are described (BEAR-A from previous work, and the new BEAR-B and BEAR-C datasets).

The authors present an extensive evaluation, which compares their own implementations in Jena and HDT of IC, CB, TB, and three hybrid approaches along with three systems developed by others (v-RDFSA, R43ples, and TailR). The first part of the evaluation focuses on comparing and explaining differences in the space requirements of the implementations / archive approaches for BEAR-A, here versions of BEAR-B, and BEAR-C. The remainder of the evaluation focuses comparing and discussing retrieval times for the five queries (with additional results presented in appendix A and B) using BEAR-A, and two version of BEAR-B dataset and queries. Here the discussion usefully interprets the various graphs (which understandably can be difficult to read given the quantity of data points), discussing some of the strengths and weaknesses of the different archiving approaches in the implementations. BEAR-C is not used in the second part of the evaluation as the systems are unable to resolve the queries; rather the queries are provided to support future research in the area.

One minor query relates to how the authors envisage that others, for example, developers of a new RDF archiving system, could reuse the BEAR framework. Having looked at the BEAR webpage and sources, it appears there are some scripts under development to run the queries. It’s a minor point that shouldn’t prevent the paper being accepted, but some more documentation / guidance on this would be beneficial to the community.

The conclusion section simply summarises the paper and briefly mentions two future work activities. Ideally this would be improved to provide the reader with what the authors feel to be the key points that they have identified from the evaluation that should be used both to guide people considering deploying an RDF archiving system, and also to shape future developments in this area.

In terms of originality, the core material related to the actual benchmark have been published previously; instantiation of the five queries in AnQL is new, as are BEAR-B and BEAR-C datasets; the main original content is the extended evaluation and associated discussions. These should provide sufficient new contributions that are useful to researchers in this and related fields, and are of relevance to the special issue. The paper is generally well written; there are a few typos (see below) and links to some higher resolution versions of the graphs would improve their legibility.

Pg 7, Left col: “especial” -> “special”
Pg 9, Right col: “end end” -> “end”
Pg 10, left col: “ckecked” -> “checked”
Pg 17, Left col: “scalability problems at large scale RDF” -> “scalability problems. At large scale specific”
Pg 21, right col: “trough” -> “through”