|Review Comment: |
This paper presents an extensive state of the art on dataset profiling.
The state-of-the-art is impressive. Unfortunately there are too many problems. This paper is not really mature enough for a journal publication.
First, there is a lack of precise framework defining the notions used. The paper actually does not define what is a dataset. This is not very strict, while catalogues and main used vocabularies like DCAT try to make crucial distinctions like dataset vs distribution. “Descriptive metadata, i.e. profiles” on p1 is also disturbing. Profiles are not limited to what is called descriptive metadata for many (e.g., access metadata is not descriptive metadata). In fact for several communities working with descriptive metadata, the notion of “application profile” could conflict with the one of “dataset profile” that is quite different.
The state-of-the-art is very extensive, and constitutes a very useful resource for would-be reader. This is probably indeed the first time this is attempted at such a scale! But there are two problems with it:
1. Even though it is extensive, it is incomplete with respect to dataset profiling. Authors focus on gathering references for dataset (quality) analysis. This is good, but there are important efforts that are not mentioned, about creating frameworks for expressing profiles, especially providing (or re-using) vocabularies. These could have been compared with what the authors proposed. DCAT and VOID are refered but the analysis is very cursory. More domain-specific efforts have been ignored: for example, the Health Care and Life Science community has researched a dataset profile (http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/? ). In the geographic domain, the EU initiative GeoDCAT-AP should also be studied. The state-of-the-art on profiling and quality also misses reference to relevant ISO standards: domain-specific as for the geo ones that influence GeoDCAT-AP or more general like the ISO25000 family (esp 25012).
2. There are in consistencies in the way the references are used. First, in section 2 many references are given for different profiles features. Most of them are data analysis papers. For a gathering of features, simple references to a vocabulary or inventory (e.g. a dataset catalogue) that exhibit the features would be enough to give a requirement for the topic. Actually this would be more convincing than a piece of academic work, possibly very technical, which may fall short giving a practical motivation for what it does (as the focus would be on an algorithm or an experiment). For example DCAT has properties for representing the domain/topic of a dataset. There is no need to refer to  that extracts such topics. On the other hand, section 3 that gathers methods to extract profile features doesn’t refer to these papers that have been cited in section 2, that presents such methods. This is quite a missed opportunity.
Furthermore, some of the references in section 2 seem superfluous for explaining what specific features are: I am not sure one would need both  and  ( looks more relevant) or both  and  or both  and .
In 2.2 there is a problem with the reference for the classification of quality characteristics, which is presented as coming partly from . This is only an ‘under review’ paper, without URL nor publication context, so it’s unclear what the source is. As a matter of fact the authors mention in  have published a very recent paper in this very journal, “Quality Assessment for Linked Data: A Survey: A Systematic Literature Review and Conceptual Framework”, so I’ve used this one. When I’ve done the comparison, I found many differences, even some features that are classified in different dimensions, which I guess would be found in any recent work of the authors of . For example licensing is in “accessibility” in , it is in Trust in the paper. It is possible that  (and other references like ) has shortcomings. But the choices made here are debatable (I would argue that licensing is orthogonal to data quality). And in any case such deviations compared to the state of the art should be explained. They are currently not even flagged as such, it’s very difficult for the reader to guess what is happening in this section. Section 2 actually reads as if the authors propose a new quality framework, which could be interesting but is arguably not what the section had embarked on.
In section 3.2 the authors should make more explicit whether they re-use the matter of  for a subset of the systems there, if they extend that matter. If that section contains original material, it should be more explicit. If not, then it can be considerably shortened.
In section 3.3 it’s unclear whether all systems mention really “extract” temporal characteristics (as written in the title of the section), or if they just manage them:
- Semantic pingback as it is described in the text mostly focuses on cases where a dataset is being re-used in other sources. In principle the pinging doesn’t change the content of the original dataset, and thus doesn’t facilitate consistency and timeliness per se. It’s possible that the publishers would integrate changes based on the pings, but it’s not essential to the general pinging approach.
- Memento is a mechanism to serve different time versions of data. It represents data that can be used to compute temporal characteristics (for example number of versions) but it doesn’t extract them by itself.
In section 4.2 and 4.3 tols are presented that compute statistics or make assessments based on them. But these tools don’t really motivate the need for statistics to be already expressed in a profile. On the contrary, they compute these statistics or extract features themselves, by querying the data and/or running inferences. It’s not very difficult to extract say, the number of properties used in a dataset. And it’s more reliable than using already published profile data, which could be outdated - for data assessment tools this is crucial.
Then, the stats on vocabulary usage analysis in section 5 is very promising, but it doesn’t look reliable. The data is from early 2015, probably it has changed one year later. There are some finding that are very surprising, such as Creative Commons being used only for 12 datasets. As the authors write it themselves later in the paper, there must be more data out there with CC licenses.
More importantly, it’s uncertain whether table 2 really gives the info the authors claim it gives in the paragraph p17-18. The text says that LOD2 is used to find info on how many times a vocabulary is used in datasets. But this doesn’t mean that the vocabularies are used in these datasets for dataset profiling (i.e. to describe datasets, e.g. instance of dcat:Dataset). For example Dublin Core and FOAF can be used to described many types of resources that are not datasets. If the authors have indeed filtered in the LOD2 data the statements that are about datasets, this should be explained in more details. Without these details, one will infer that the data is not about datasets, and thus that it’s not very informative for section 5 in general.
Finally, the paper completely falls short on presenting the “RDF vocabulary for unambiguously identifying dataset features” that was promised in the abstract. A link is given to http://data.data- observatory.org/vocabs/profiles/ns, but the elements of this vocabulary are not listed and documented. And there’s no instruction/example of how to use it. Shall it be combined with existing vocabularies? Is it an alternative to all of them, combined?
Some more minor comments (only selected bits, as there are just too many to report here):
- p1: “As the Web of Data is constantly evolving, manual assessment of dataset features is neither feasible nor sustainable.” This statements is debatable. Sure, it won’t be possible to profile all datasets manually, but one could argue that it would be a feature of good providers that they provide at least some profile metadata.
- p5: the difference between stability of URIs and stability of links is quite unclear, as the only definition given to characterize the stability of links refers to stability of URIs.
- p5: what does “explore the space of a given source, i.e., search and discover data sources of interest.” mean?
- p5: the relation between the references in section 3 and the criteria at 2 is unclear at times. For “electing the smallest set of relevant predicates representing the dataset in the instance comparison task”, do the predicates correspond to a specific criterion in section 2? (are they RDF predicates? And why the paragraph say “we review” while it does just drop the various aspects of the keys discovery approach?)
- p5: footnote 9 is not finished.
- p12:  reads more like state-of-the-art for section 2.2 than a vocabulary for 5.2. Same comment applies to the bullet list in the second column of this page.
- p12: why not mention EDOAL as a reference for alignment vocabulary, next to (or instead of) VoL? Why not mention that VoID also has a part for linksets? Why not mention daQ for data quality? SPIN on the other hand is not for expressing data quality features. Representing rules is quite different from representing the results of applying rules.
- p15: Dublin Core is also used by DCAT and many others for licensing, so it should be in 5.6 (and maybe also other sub-sections as it’s a very general vocabulary)