Review Comment:
Dear all,
I answered to authors by preceding my answers with a "*". I kept some part of the answers of authors to contextualize my answers.
The edited parts of the paper have been highlighted in blue for ease of review.
* Thanks. It is very convenient for reviewing.
CP1. We have created a website (mentioned in Section 3) with a link to the Git repository of RDFev (https://gitlab.com/opelgrin/rdfev) and the experimental datasets. The website is at https://relweb.cs.aau.dk/rdfev.
* ok.
CP2. [R2] Although these parts are related, it could also be 3 different papers: a study, a survey, and a vision paper [...] Research directions of section 7 remain blurry. There is a gap between spotted issues and conclusions from experiments.
We have revamped Section 7 by (i) giving more concrete examples of what we mean by a fully-fledged RDF archiving solution, by (ii) providing a more fine-grained discussion of the important design aspects of such a solution, and by (iii) framing those aspects in the light of the results presented in Sections 4-6.
* I reviewed the changes and it is clearly better.
CP3. Quit Store. We have added Quit Store to the list of studied systems in our survey in Section 5. We have also tested it according to our experimental setup (see “CP4. Limitations of the testable systems” for more details).
* Ok. great.
CP4. Limitations of the testable systems. [R2] The root cause of the performance limitation of the only tested system is not established. Performances issues are not sufficiently covered in the paper. I think this is a very important issue for RDF archiving systems.
* Well, Quit-Store has been added in the paper, but it cannot support
the setup of the paper. Many interesting points have been added to the
paper, but after reading the whole paper, I don’t see clearly if there
are algorithmic challenges to solve to achieve full-fledged RDF
archiving.
To be clear, for many systems, we don’t know if issues are
coming from design issues, bugs in existing software, or algorithmic
limitations (I suppose, authors will argue a little of each). Section 7 is
interesting to read, but it looks a little like a Santa’s list. For
example, “archiving of the local and joint (global) history of
multiple RDF graphs. We argue that such a feature is vital for proper
RDF archiving”. Maybe you are right, but is this archiving sustainable
on the algorithmic side? If you have a grow-only system, you will have
to cut things to make it sustainable. Even Git proposes mechanisms to
shorten histories. I guess that for each nice functionality of an RDF
archive system, there is a price to pay on storage size/query
performance to ensure scalability on revisions.
* The "Complexity" word appears only once in the paper, for “ingestion time
complexity” “that should depend more on the size of the changeset than
the history size”. Are there no other complexities that should be
independent of the number of revisions or size of the history?
* The word “scalable” just appears three times in the paper. “We argue
that vector and matrix representations may be suitable for scalable
RDF archiving as they allow for good compression”. I think that
scalability addresses nearly all aspects of a full-fledge archive
system and should be connected to my previous remark on complexity.
* Scalability is already a big challenge for knowledge graphs without
considering archiving. Considering archiving makes things much more
challenging. I'm very concerned about how you keep control of
complexities to ensure the scalability of an RDF archive system. For
example, it could be nice to have high expressiveness on queries to
query an RDF archive system, but if it is not tractable, it will be
just useless. Can you point out some challenging scientific problems
that need to be solved to make a full fledged RDF archive system a reality?
== Reviewer 2 ==
R2-P1. Example in Figure 2. When applied to G0, I cannot obtain G1. I have the same problem with figure 2.
Thank you pointing this out, there was indeed a mistake in Figure 2. We have fixed it.
* Great.
R2-P2. Newer Wikidata versions. After this period, wikidata was ∼2billions triples, and it is now 8billions triples. I’m not sure that observations made on wikidata before 2016 can be extended to the last period. Consequently, conclusions of section 4.4 may be impacted if the recent period was included. Overall, just spotting from this long experiment that there are major and minor releases on datasets is a little disappointing.
We acknowledge that Wikidata’s evolution patterns may have changed in recent versions. We chose the available RDF exports in the aforementioned period because they are of manageable sizes and because they were carefully prepared to store the core information that applications are likely to use. The full Wikidata dumps for all versions up to now are, unfortunately, not manageable with our storage resources. Each dump is around 800 GB (https://archive.org/details/wikimediadownloads?and%5B%5D=%22Wikidata%20e..., there are more than 50 since 2018), and generating the RDF exports will require even more disk space. This is why the main goal of Section 4.4 is now stated more clearly: To show that a metric-based framework can help us get insights about the evolution of a given stream of versions. By providing the source code of RDFev now anyone with enough computer resources can conduct such an analysis. We are currently working on getting more resources in order to study the entire available history of Wikidata.
* I know that these experiments are difficult, require a lot of time
and resources. You can write that “Wikidata exhibits a stable release
cycle in the studied period”, but you know that there is a major
release after 2016, just because they started to contextualize facts.
if I imagine the impact of wikidata changes after 2016 in 3G,3h,3i, we
should see outliers in the last revisions of wikidata curves. In any
case, if I look at figure 3, but column by column, we can also say
there is no similar evolution pattern between datasets. I found that
saying that these experiments show 2 kinds of evolution patterns
cannot really be deduced from numbers. I agree that systems should
adapt to various situations and cannot bet on one evolution pattern.
R2-P3. Ingestion time. [...] It is mainly a complexity problem caused by the fact that the i-th revision of an archive is stored as a changeset of the form u_{0,i}, in other words, Ostrich’s changesets are redundant and can only grow in size. We agree with this remark and we have revamped our analysis in Section 6.2 so that it inquires more into the causes of this phenomenon. Indeed, ingestion time complexity would become prohibitive for long histories.
* Ok. But this problem is specific to Ostrich? Or it is a problem of a
pure changeset based solution? Can a change-set based solution
scales without inserting at the right place in the history Snapshots
(as in many revision control systems such as Mercurial SCM) ?
R2-P4. Conclusions from Section 6. More generally, I don’t see, from the experiments of section 5, clear evidence of issues highlighted in the introduction. Ostrich is the only system that works, and finally, it seems that the only problem is related to ingestion time. IMHO, it impacts the storytelling of the paper. Maybe the results of the experiments of section 5 can be used more wisely to highlight issues spotted in the introduction and improves the storytelling of the paper.
Thank you for raising this point. However, we remark that Ostrich also takes very long to run some queries in YAGO. To enrich the findings of Section 6, we have conducted a deeper data-oriented analysis of the factors that impact the query runtime of Ostrich. The number of revisions ranks highest. We have adapted both the introduction and Section 6 accordingly.
* ok
R2-P5. Query language. In the introduction, the lack of a standard for querying was highlighted, but it is not present in section 7. From the survey section, the only system working was only proposing VM, DM, V queries. Does it mean that
a standard should concentrate on these query types?
The revised version of Section 7 discusses this point in detail. We propose SPARQL* (enhanced with additional syntactic sugar) as it can natively express CD and CV queries.
R2-P6. RDF*. On serialization and querying paragraph of section 7, RDF* is highlighted. I agree that RDF* makes reification easier, but it is just syntactic and performances are significantly impacted.
We agree with this remark and we have modified our proposal in Section 7 about the role of RDF* in a fully-fledged archiving solution.
* ok.
To resume, I think that the first part of the paper (evolution
analysis) is ok. Maybe the conclusions of this part are not 100%
confirmed by the numbers (from my point of view). The second part on
tools benchmarking is ok. I learned that nothing really works, maybe
some commercial tools that are not part of the setup. I think that
Section 7 can be still improved by taking a step back and raising more
abstract issues than the ones written in the paper (that are not bad
but can be improved).
|