Review Comment:
The paper "Towards Fully-fledged Archiving for RDF Datasets" consists of an introduction (section 1), preliminaries (section 2), two major parts (sections 3-4 and 5-7) and a short conclusion.
The two major parts are a "Framework for the Evolution [of] RDF Data" based on the metrics proposed by Fernández et al. in [18] in section 3 and exemplary applied in section 4; and a "Survey of RDF Archiving Solutions" in section 5, which is discussed in sections 6 and 7.
From the title "Towards Fully-fledged Archiving for RDF Datasets" I expect a thorough requirements analysis and conceptual specification of an RDF Archiving system with a classification of the related work.
The four points listed in the abstract
(i) existing systems are neither scalable nor easy to use,
(ii) no solution supports multi-graph RDF datasets,
(iii) there is no standard way to query RDF archives, and
(iv) solutions do not exploit the evolution patterns of real RDF data.
support these expectations.
While reading the paper I see contributions towards (ii)-(iv), while I'm missing a definition of "scalable" and "easy to use". (ii) is not true, in Table 2 you list Dydra with Multi-graph support, according to our research [ANR+18] also R43ples supports multiple graphs (maybe not full RDF datasets, but still "multi-graph" support), and the Quit Store [ANR+18] has support for real RDF datasets.
I was very surprised that the Quit Store [ANR+18] is not among the list of related work. As it provides answers to some of the raised questions. Compared to R43ples and R&Wbase it was already evaluated regarding its scalability, it comes with a user interface that guides through the versioning system, it supports multi-graph RDF datasets, it provides a standard SPARQL 1.1 Query & Update interface with full BGP support and a virtual endpoint for each version in the history. Further it allows VM, DM, and V queries (DM and V using the provenance endpoint), it has support for branches and tags, allows concurrent updates (see also [AR19]), and it is Open Source (GPLv3, https://github.com/AKSW/QuitStore and available as docker image). Maybe you should also include stardog in your comparison, even though the source code is not available, as for many of the systems (which, I agree, is a huge problem for research!!!).
For the selection of systems to evaluate the performance you state "In addition to the limitations in functionality, Table 2 shows that most of the existing systems are not easily available, either because their source code is not directly usable, or because they do not easily compile in modern platforms." I admit, that it is not easy to get research prototypes of other teams running but I would expect a some more effort to get the systems running for a proper comparison. In the course of performing the evaluations for the Quit Store [ANR+18] we have made docker images available for the R43ples and R&Wbase systems (https://github.com/AKSW/r43ples-docker, https://github.com/AKSW/rawbase-docker). We had to ask the respective teams for some support but in the end it worked. You are welcome to re-use these images. Further you you should state what exactly are the limitations of R43ples that made it not possible to run your experimental datasets.
In section 2.1 you define the label g ∈ I. But the RDF standard says "Each named graph is a pair consisting of an IRI or a blank node (the graph name), and an RDF graph." (https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#section-dataset). Please also check your section 2.4 with this regard.
Please make your definitions compatible with the RDF standard here.
In general this raises the question how do you deal with blank nodes in general?
In section 2.2 p2 l28 right you define with "∆^+_i ∪ ∆^-_i != ∅" that the changeset is not allowed to be empty. Why not allow empty changesets? Is there any problem with allowing empty changesets? That could be an option to make your notation of dataset changesets easier. You could use the same revision numbers for the dataset as for each graph and you would not need to deal with the drift of the two indices.
Further, in section 2.2, you define the notion of a changeset as follows: "We extend the notion of changesets to arbitrary pairs of revisions i, j with i < j, and use the notation u_ij = <∆^+_{i,j} , ∆^-_{i,j}>." As I understand it, i and j are the numbers of the revisions of a graph. In section 2.3 you transfer your concept from graphs to datasets. You then use several notations to address a graph in a dataset. In one places you use û_j and in another place u^1_1 but these notations are not properly introduced. I would expect a concise definition of a changeset for datasets.
In section 2.3 you write "In contrast to an RDF graph archive, an RDF dataset is a set D = {G^1, G^2, …, G^m} of named graphs where each graph has a label g^k ∈ I. Differently from revisions in a graph archive, we use the notation G^k for the k-th graph in a dataset, whereas G^k_i denotes the i-th revision of G^k ."
Please see the remarks to section 2.1 with regard to blank nodes here, according to the RDF 1.1 Concepts and Abstract Syntax g^k ∈ I∪B. Also per the definition of RDF 1.1 Concepts and Abstract Syntax a dataset consists of exactly one default graph. How do you deal with the default graph?
Please put the relevant formulas in section 2 especially 2.2 and 2.3 in separate definition environments and try to illustrate them with simple conceptual figures or examples. In the current state the definitions are convoluted with examples, this is hard to read. Maybe you can also take a look at the formalization to express changes as presented in [ANR+18] sections 6 and 7. This formalization might not be perfect, but it could be a possibility to pick up some ideas about what I want to say and maybe you can even extend the model according to your needs.
In section 3 on page 5 your provide a link to your source code. But Dropbox is no proper archive. Could you please provide your source code in some proper software source code archive? Else you contribute further to the problem that scientific prototypes are hard to reproduce.
I like the idea of ρ, ζ, l(), and rv(). But there is some problem with the consistent usage of the notation. On p10 l1 your write and i = rv(ρ). On p11 l42 your write , why don't your consistently write here. Is there a semantic difference? The same in l51.
In general I like how you include the categorization of types of queries and archiving policies as they were introduced by Fernández et al. in [18] respective [17] into your research. (Sometimes you refer to [18] and sometimes to [17] with this regard, that should be consistent.) But you should better point out what are the new contributions that you make to this model. Also I suggest to add the archiving policy "Fragment-based (FB)" [ANR+18] for systems that take snapshots of fragments of a dataset and thus are neither Independent Copies (IC) nor Change-based approaches (CB).
Regarding your citation style I have some remarks. You often use references as nouns in your sentences e.g. "the approach presented in [13] relies …". In my eyes "[13]" is no word and makes it hard to read such sentences. It would be better if you use phrases like "the approach presented by Dong-hyukim et al. [13] relies …" or if you name the respective systems. I can't remember all of the numbers while reading the text.
For reference [50] there is some problem with the encoding "… WWW âĂŹ19, page 961âĂŞ965, …". Also for [8].
Could you also please add the DOIs for your references?
For W3C standards could you please refer to the actual respective RDF standard documents e.g. RDF 1.1 Concepts and Abstract Syntax (https://www.w3.org/TR/rdf11-concepts/) instead of the non normative W3C Working Group Note (RDF 1.1 Primer).
Also it is good to reference an exact version: e.g. for [42] Yves Raimond and Guus Schreiber. RDF 1.1 semantics. W3C recommendation, 2014. http://www.w3.org/TR/rdf11-mt/. use http://www.w3.org/TR/2014/REC-rdf11-mt-20140225/ instead.
As a summary I would expect a conceptual model or specification of a "fully-fledged solution". The various aspects and features expected by the authors from a fully-fleged archiving system are distributed across the paper. So this should be a matter of bringing them all together in a concise definition, which might involve some major re-work of the paper.
The overarching story to integrate the major parts of the paper "Framework for the Evolution [of] RDF Data" and "Survey of RDF Archiving Solutions" needs to be clear, currently they seem to be two different topics which should be part of two individual papers. Also it is not clear to me if this is a "Full Paper" or a "Survey Article".
The comparison lacks a state of the art system.
There are some problems in its presentation (citations, formula and missing examples).
As a result I like the idea of the paper and would really like to see this paper published but at its current state it needs a MAJOR REVISION clarifying all of the mentioned aspects.
[ANR+18] Decentralized Collaborative Knowledge Management using Git by Natanael Arndt, Patrick Naumann, Norman Radtke, Michael Martin, and Edgard Marx in Journal of Web Semantics, 2018. https://doi.org/10.1016/j.websem.2018.08.002; https://natanael.arndt.xyz/bib/arndt-n-2018--jws
[AR19] Conflict Detection, Avoidance, and Resolution in a Non-Linear RDF Version Control System: The Quit Editor Interface Concurrency Control by Natanael Arndt, and Norman Radtke in Companion Proceedings of the 2019 World Wide Web Conference (WWW '19 Companion), 5th Workshop on Managing the Evolution and Preservation of the Data Web, San Francisco, CA, USA, 2019. https://dx.doi.org/10.1145/3308560.3316519; https://natanael.arndt.xyz/bib/arndt-n-2019--qeicc
Some pedantic notes:
p3 l22 right: "(variables are always prefixed with the symbol ?)". This is not correct it can also be "$".
p4 l27 right: Framework for the Evolution [of] RDF Data
p11 l38,39,42,48 left, l22 right, and Table 2: Dryda -> Dydra (please go through the document with find and replace)
p5 l12-l13 right: "The grow ratio is the ratio between [between] the number of triples in two revisions i, j."
p6 l26-28 left: "These are subsets of triples of the same nature, e.g., triples with literal objects extracted with certain extraction method[s]."
p10 l40 right: "R&WBase has inspired the design of R43ples [24]. Unlike its predecessor, [24] …" Does inspiring something make the one the predecessor of the other?
p14 l40 left "that go beyond the [?], treat …" In place of [?] there is something missing in the sentence.
p14 l31 right: vocabular[it]y dynamicity
|