Scalable Long-term Preservation of Relational Data through SPARQL queries

Tracking #: 467-1647

Authors: 
Silvia Stefanova
Tore Risch

Responsible editor: 
Christoph Schlieder

Submission type: 
Full Paper
Abstract: 
We present an approach for scalable long-term archival of data stored in relational databases (RDBs) as RDF, implemented in the SAQ (Semantic Archive and Query) system. The proposed approach is suitable for archiving scientific data used in scientific publications where it is desirable to preserve only parts of an RDB, e.g. only data about a specific set of experimental artefacts in the database. With the approach, long-term preservation as RDF of selected parts of a database is specified as an archival query in an extended SPARQL dialect, A-SPARQL. The query processing is based on automatically generating an RDF view of a relational database to archive, called the RD-view. A-SPARQL provides flexible selection of data to be archived in terms of a SPARQL-like query to the RD-view. The result of an archival query is a data archive file containing the RDF-triples representing the relational data content to be preserved. The system also generates a schema archive file where sufficient meta-data are saved to allow the archived database to be fully reconstructed. An archival query usually selects both properties and their values for sets of subjects, which makes the property p in some triple patterns unknown. We call such queries where properties are unknown unbound-property queries. To achieve scalable data preservation and recreation, we propose some query transformation strategies suitable for optimizing unbound-property queries. These query rewriting strategies were implemented and evaluated in a new benchmark for archival queries called ABench. ABench is defined as set of typical A-SPARQL queries archiving selected parts of databases generated by the Berlin benchmark data generator. In experiments, the SAQ optimization strategies were evaluated by measuring the performance of A-SPARQL queries selecting triples for archival queries in ABench. The performance of equivalent SPARQL queries for related systems was also measured. The results showed that the proposed optimizations substantially improve the query execution time for archival queries.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Günther Görz submitted on 17/May/2013
Suggestion:
Accept
Review Comment:

As a non-insider to the long-time preservation of relational
databases, I understood the paper as a contribution to the
discussion by recommending a viable approach and not as the
solution of all problems in the field.

My impression is that the other reviewers were too critical in
this respect, but I agree to their technical recommendations.
The author's explanations on the detailed technical comments and
recommendations are comprehensible and I think the paper has been
improved by answering to the imposed constraints.

Therefore, I would agree to the publication of the revised
version.

Review #2
Anonymous submitted on 03/Sep/2013
Suggestion:
Minor Revision
Review Comment:

The authors have addressed almost all of my comments.

The major open issue is the definition of the Archival queries. This is not yet well defined. See Comment 2 and 3.

======================================================================
COMMENT 1

Round 1 comment:

b) If the work is focused on relational databases, why aren't CSV dumps
sufficient?

Authors' response:

Answer:
b) The fourth paragraph in the introduction section is sharpened and states
“…it is desirable for the contents of a database to be unloaded in a
neutral format … “ and “…preserved representations must include
sufficient meta-data to retrieve, explain, reproduce, and disseminate …”

Round 2 comment:

Is there any related work that can be provided on using CSV for long-term preservation of data? CSV could also be considered be "neutral format". However, there is not standard way of representing meta-data in CSV, unless the data dictionary of the database is dumped to CSV also.

Bottom-line: I would like to see a clear case of why RDF is the chosen format versus others.

======================================================================
COMMENT 2

Round 1 comment:

c) After the keyword TRIPLES, comes "archived triple pattern". However, Query
A8, has a set of triple patterns. I assume that instead of an "archived
triple pattern", it is a basic graph pattern, which can have 1 or more triple
patterns.

Authors' response:

Answer:
In the new semantic description in Sec 3.1 there is a clear explanation that
in a TRIPLES clause the user specifies “archived triple patterns” and an
optional “archive restriction” in a WHERE clause. An archive restriction
restricts the triples to archive. It consists of a graph pattern and may
include SPARQL functions, which is the case in A8.

Round 2 comment:

This is not yet clear. First of all, 'archived_triple_patterns' is not defined anywhere. I'm having to look at the example to understand what it is. I believe that a 'archived_triple_patterns' is a single triple pattern (s, p, o) where s, p, or o can be constants URIs or variables. In the examples I do not see a case where there is more than one triple pattern. If so the term 'archived_triple_patternS' is misleading (should not be plural) because it is only one triple pattern.

======================================================================
COMMENT 3

Round 1 comment:

c) I would recommend to formally present the semantics either by 1) using
rules/datalog syntax to represent the translation or 2) defining its own
semantics following the approach of the semantics of SPARQL by Perez et al
(Jorge Pérez, Marcelo Arenas, and Claudio Gutierrez. 2009. Semantics and
complexity of SPARQL. ACM Trans. Database Syst.) and comparing the
expressivity of A-SPARQL with SPARQL CONSTRUCT. This way, there would be no
room for ambiguity.

Authors' response:

Answer:
The translation rules are now simplified and much better explained.

Round 2 comment:

The translation rules from A-SPARQL to the generated SPARQL are not formalized and consists of two sentences, which are hard to follow.

For example, the first states: " 1) The CONSTRUCT clause of the translated SPARQL query consists of all the archived triple patterns in the TRIPLES clauses of the archive specifications."

If we look at query A2, there are two TRIPLES clauses, and each one has the following triple pattern: ?subject ?property ?value. If I follow the translation, the construct clause would have this triple pattern twice. However, in the construct query Q2, this is not the case (and obviously not what is expected). This is an example of the ambiguity when there is lack of formality.

In 1), you make reference to "TRIPLES", but in 2) you make reference to archive specification, archive triple patterns and optional archive restrictions. Please be consistent.

======================================================================
COMMENT 4

Round 1 comment:

2) Following my question (1), the sub-views, even though not defined, seem
very similar to datalog rules of the W3C Direct Mapping (Appendix B) and
Sequeda et al's Augmented Direct Mapping. What is the relationship?

Authors' response:

Answer:
It is now clearly stated now in section 5.1 that “The RDB to RDF mapping in
SAQ conforms to the direct mapping recommended by W3C [23], and more
particularly to the augmented direct mapping proposed in [19], which is
proven to guarantee information preservation. “

Round 2 comment:

If this is the case, then I don't see the novelty of section 5.1. I understand that it is needed for the system and it is on top of the RD-view that the unbounded predicate queries are executed and optimized (which is novel and a contribution of the work). I understand this as restating [23] and [19] in a different syntax.

Maybe a clarification that this is not a contribution but needed in order to understand how the system is built?

======================================================================
COMMENT 5

Round 1 comment:

6) The benchmark queries sometimes include the triple: ?class rdf:type
rdfs:Class. I believe that this triple is needed for SAQ because it access
their mapping table which maps schema elements to RDFS elements. Did the
benchmark queries to D2RQ and Virtuoso include that triple? If so, this may
be a cause for the slow performance. If this is the case, what happens when
that triple is not included? What happens to SAQ?

Authors' response:

Answer:
In Sec, 62 the following paragraph is added: “Since both D2RQ and Virtuoso
don’t generate for their default mapping a triple with the form (subject
rdf:type rdfs:Class), this triple was excluded from the definitions of
queries Q2, Q5 and Q6 for these systems.”

Round 2 comment:

This answer addressed only one part of my question. The other part is not answered: What happens in SAQ if <> rdf:type rdfs:Class queries are not included. Why do some queries have that triple pattern (A2, A5, …) and some don't (A8, … )

======================================================================
COMMENT 6

Round 1 comment:

- Why is [22] cited when making reference to Datalog? I would suggest to cite
instead the Foundations of Databases book by Abiteboul, Hull and Vianu.

Authors' response:

Answer:
We actually use a Datalog dialect different than the above, which is the
reason for the reference.

Round 2 comment:

Which dialect of Datalog? And why that one? Please be specific.

Review #3
By Christoph Schlieder submitted on 09/Sep/2013
Suggestion:
Accept
Review Comment:

In my review of the original version of the article, I raised a number of issues from the perspective of digital preservation research which the authors have resolved in their revised version.

The first group of issues relates to the idea of data selection in preservation. More specifically, the following points remained unclear (1): Why would archiving the RDB2RDF mapping not solve the preservation problem? (2) What would be a plausible application scenario for the selective preservation of relational data? (3) Could the approach solve problems which are known to involve data selection such as geodata archiving?

The authors now explain why for scientific data a selection step is often necessary before preserving relational data. In this context, the reference to the Scientific Publication Packages described by Hunter (2006) is especially useful. The authors have added a scenario description – data selection as part of the scientific research process prior to publication – which is consistent with their approach. They also made an effort to link the Berlin benchmark product data example to the scenario.

A second group of issues concerned the preservation workflow: (4) what are the connections to the OAIS reference model? (5) How does the approach compare to related work not based on RDF?

The revised article describes the approach as part of the ingest process of the OAIS reference model. In the expanded related work session, the discussion has been broadened and includes approaches not based on RDF.

In summary, the article has considerably gained in precision and relevance for readers with a background in digital preservation. I therefore recommend publishing it.