Towards Fully-fledged Archiving for RDF Datasets

Tracking #: 2538-3752

Authors: 
Olivier Pelgrin
Luis Galàrraga
Katja Hose

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
Abstract: 
The dynamicity of RDF data has motivated the development of solutions for archiving, i.e., the task of storing and querying previous versions of an RDF dataset. Querying the history of a dataset finds applications in data maintenance and analytics. Notwithstanding the value of RDF archiving, the state of the art in this field is under-developed: (i) most existing systems are neither scalable nor easy to use, (ii) there is no standard way to query RDF archives, and (iii) solutions do not exploit the evolution patterns of real RDF data. On these grounds, this paper surveys the existing works in RDF archiving in order to characterize the gap between the state of the art and a fully-fledged solution. It also provides RDFev, a framework to study the dynamicity of RDF data. We use RDFev to study the evolution of YAGO, DBpedia, and Wikidata, three dynamic and prominent datasets on the Semantic Web. These insights set the ground for the sketch of a fully-fledged archiving solution for RDF data.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Andre Valdestilhas submitted on 23/Sep/2020
Suggestion:
Accept
Review Comment:

The paper addresses all my previous comments appropriately well. Thus, I recommend it's acceptance.

Review #2
By Natanael Arndt submitted on 09/Oct/2020
Suggestion:
Minor Revision
Review Comment:

The presented revised version is very nice to read and includes a lot of new details.

The introduction of the basic notation used in the paper in Table 1 is very good and provides a good overview for the reader. To be pedantic, you can introduce triple and quad next to 5-tuple as well ;-)

As stated in my last review I still think you can improve the readability if you put the relevant formulas in section 2 especially 2.2 and 2.3 in separate definition environments.

Regarding Table 4, why does R43ples have no support for Multi-graph? In our eyes it has, as mentioned in my last review. You also state in your text "R43ples can version multiple graphs" (section 5.1.2, p12 l30).

The bibliography looks much better now.
For the Quit Store please use the journal paper as a reference (https://doi.org/10.1016/j.websem.2018.08.002) instead of the Extended Abstract you are referencing at the moment.

Some pedantic notes:

p13 l20 "(e.g. N-quads files)" -> "(i.e. N-triples files)" (e.g. -> i.e. and we use n-triples).

p18 l42 "R4triples [26]" -> "R43triples [26]"

So there are just very minor changes left.

Review #3
By Pascal Molli submitted on 12/Dec/2020
Suggestion:
Minor Revision
Review Comment:

Dear all,

I answered to authors by preceding my answers with a "*". I kept some part of the answers of authors to contextualize my answers.

The edited parts of the paper have been highlighted in blue for ease of review.

* Thanks. It is very convenient for reviewing.

CP1. We have created a website (mentioned in Section 3) with a link to the Git repository of RDFev (https://gitlab.com/opelgrin/rdfev) and the experimental datasets. The website is at https://relweb.cs.aau.dk/rdfev.

* ok.

CP2. [R2] Although these parts are related, it could also be 3 different papers: a study, a survey, and a vision paper [...] Research directions of section 7 remain blurry. There is a gap between spotted issues and conclusions from experiments.
We have revamped Section 7 by (i) giving more concrete examples of what we mean by a fully-fledged RDF archiving solution, by (ii) providing a more fine-grained discussion of the important design aspects of such a solution, and by (iii) framing those aspects in the light of the results presented in Sections 4-6.

* I reviewed the changes and it is clearly better.

CP3. Quit Store. We have added Quit Store to the list of studied systems in our survey in Section 5. We have also tested it according to our experimental setup (see “CP4. Limitations of the testable systems” for more details).

* Ok. great.

CP4. Limitations of the testable systems. [R2] The root cause of the performance limitation of the only tested system is not established. Performances issues are not sufficiently covered in the paper. I think this is a very important issue for RDF archiving systems.

* Well, Quit-Store has been added in the paper, but it cannot support
the setup of the paper. Many interesting points have been added to the
paper, but after reading the whole paper, I don’t see clearly if there
are algorithmic challenges to solve to achieve full-fledged RDF
archiving.

To be clear, for many systems, we don’t know if issues are
coming from design issues, bugs in existing software, or algorithmic
limitations (I suppose, authors will argue a little of each). Section 7 is
interesting to read, but it looks a little like a Santa’s list. For
example, “archiving of the local and joint (global) history of
multiple RDF graphs. We argue that such a feature is vital for proper
RDF archiving”. Maybe you are right, but is this archiving sustainable
on the algorithmic side? If you have a grow-only system, you will have
to cut things to make it sustainable. Even Git proposes mechanisms to
shorten histories. I guess that for each nice functionality of an RDF
archive system, there is a price to pay on storage size/query
performance to ensure scalability on revisions.

* The "Complexity" word appears only once in the paper, for “ingestion time
complexity” “that should depend more on the size of the changeset than
the history size”. Are there no other complexities that should be
independent of the number of revisions or size of the history?

* The word “scalable” just appears three times in the paper. “We argue
that vector and matrix representations may be suitable for scalable
RDF archiving as they allow for good compression”. I think that
scalability addresses nearly all aspects of a full-fledge archive
system and should be connected to my previous remark on complexity.

* Scalability is already a big challenge for knowledge graphs without
considering archiving. Considering archiving makes things much more
challenging. I'm very concerned about how you keep control of
complexities to ensure the scalability of an RDF archive system. For
example, it could be nice to have high expressiveness on queries to
query an RDF archive system, but if it is not tractable, it will be
just useless. Can you point out some challenging scientific problems
that need to be solved to make a full fledged RDF archive system a reality?

== Reviewer 2 ==
R2-P1. Example in Figure 2. When applied to G0, I cannot obtain G1. I have the same problem with figure 2.
Thank you pointing this out, there was indeed a mistake in Figure 2. We have fixed it.

* Great.

R2-P2. Newer Wikidata versions. After this period, wikidata was ∼2billions triples, and it is now 8billions triples. I’m not sure that observations made on wikidata before 2016 can be extended to the last period. Consequently, conclusions of section 4.4 may be impacted if the recent period was included. Overall, just spotting from this long experiment that there are major and minor releases on datasets is a little disappointing.
We acknowledge that Wikidata’s evolution patterns may have changed in recent versions. We chose the available RDF exports in the aforementioned period because they are of manageable sizes and because they were carefully prepared to store the core information that applications are likely to use. The full Wikidata dumps for all versions up to now are, unfortunately, not manageable with our storage resources. Each dump is around 800 GB (https://archive.org/details/wikimediadownloads?and%5B%5D=%22Wikidata%20e..., there are more than 50 since 2018), and generating the RDF exports will require even more disk space. This is why the main goal of Section 4.4 is now stated more clearly: To show that a metric-based framework can help us get insights about the evolution of a given stream of versions. By providing the source code of RDFev now anyone with enough computer resources can conduct such an analysis. We are currently working on getting more resources in order to study the entire available history of Wikidata.

* I know that these experiments are difficult, require a lot of time
and resources. You can write that “Wikidata exhibits a stable release
cycle in the studied period”, but you know that there is a major
release after 2016, just because they started to contextualize facts.
if I imagine the impact of wikidata changes after 2016 in 3G,3h,3i, we
should see outliers in the last revisions of wikidata curves. In any
case, if I look at figure 3, but column by column, we can also say
there is no similar evolution pattern between datasets. I found that
saying that these experiments show 2 kinds of evolution patterns
cannot really be deduced from numbers. I agree that systems should
adapt to various situations and cannot bet on one evolution pattern.

R2-P3. Ingestion time. [...] It is mainly a complexity problem caused by the fact that the i-th revision of an archive is stored as a changeset of the form u_{0,i}, in other words, Ostrich’s changesets are redundant and can only grow in size. We agree with this remark and we have revamped our analysis in Section 6.2 so that it inquires more into the causes of this phenomenon. Indeed, ingestion time complexity would become prohibitive for long histories.

* Ok. But this problem is specific to Ostrich? Or it is a problem of a
pure changeset based solution? Can a change-set based solution
scales without inserting at the right place in the history Snapshots
(as in many revision control systems such as Mercurial SCM) ?

R2-P4. Conclusions from Section 6. More generally, I don’t see, from the experiments of section 5, clear evidence of issues highlighted in the introduction. Ostrich is the only system that works, and finally, it seems that the only problem is related to ingestion time. IMHO, it impacts the storytelling of the paper. Maybe the results of the experiments of section 5 can be used more wisely to highlight issues spotted in the introduction and improves the storytelling of the paper.
Thank you for raising this point. However, we remark that Ostrich also takes very long to run some queries in YAGO. To enrich the findings of Section 6, we have conducted a deeper data-oriented analysis of the factors that impact the query runtime of Ostrich. The number of revisions ranks highest. We have adapted both the introduction and Section 6 accordingly.

* ok

R2-P5. Query language. In the introduction, the lack of a standard for querying was highlighted, but it is not present in section 7. From the survey section, the only system working was only proposing VM, DM, V queries. Does it mean that
a standard should concentrate on these query types?
The revised version of Section 7 discusses this point in detail. We propose SPARQL* (enhanced with additional syntactic sugar) as it can natively express CD and CV queries.
R2-P6. RDF*. On serialization and querying paragraph of section 7, RDF* is highlighted. I agree that RDF* makes reification easier, but it is just syntactic and performances are significantly impacted.
We agree with this remark and we have modified our proposal in Section 7 about the role of RDF* in a fully-fledged archiving solution.

* ok.

To resume, I think that the first part of the paper (evolution
analysis) is ok. Maybe the conclusions of this part are not 100%
confirmed by the numbers (from my point of view). The second part on
tools benchmarking is ok. I learned that nothing really works, maybe
some commercial tools that are not part of the setup. I think that
Section 7 can be still improved by taking a step back and raising more
abstract issues than the ones written in the paper (that are not bad
but can be improved).