Review Comment:
The article describes an in-depth empirical study of various quality indicators for Linked Open Data, using a larger-scale collection of datasets. Furthermore, the authors suggest the use of PCA to judge the utility and redundance of the individual quality metrics.
In general, the topic is interesting and useful, but the article has a few flaws that should be addressed before publication.
First of all, the mathematical notations often seem to follow rather unusual notation schemes, at least to me. Although it is possible in most cases to grasp the idea of the formulas, in particular when studying the accompanying text, the formulas should be exact in themselves, not an approximation of what is meant.
The use of PCA is interesting, but the corresponding section is only partly informative. Furthermore, my feeling is that simple correlation analysis of the metrics may yield the same information, i.e., understanding which metrics are redundant. Furthermore, I miss some conclusions here, e.g., stating that for two highly correlated metrics, it may be enough to compute one (ideally the one with the least computational efforts).
With regard to related work, there are a few more works in the area that could be mentioned. Both Zaveri et al. (52) and Schmachtenberg et al. (46) are used later, but not listed as related work. Some more related work can be found, e.g., by looking at the PROFILES workshop series [1]. Furthermore, it might be worthwhile looking at a similar analysis we conducted for schema.org [6], as we also used some similar metrics, e.g., such as CS3.
Some of the metrics discussed are arguable. Examples include:
* RC1: The authors suggest that there is a consensus that shorter URIs are better. However, there are also counter-arguments, e.g., for self-unfolding URIs [3].
* RC2: The authors themselves state an example where ordering is essential (i.e., authors of a publication). Hence, RC2 would "punish" datasets in domains where such cases exist.
* IO1: This metric is about proprietary vocabularies, and for computation, the fact whether a vocabulary is registered in LOV is used as a proxy for a vocabulary being non-proprietary. From my understanding, it would be a better proxy to analyze whether the vocabulary is used by other datasets as well.
* U1 computes the set of subject URIs (why only subjects, not objects as well?) which have a label. This may not be a good metric in some cases. For example, in a Linked Data dataset containing a data cube [4], it would not make sense to equip each single observation with a label. A similar argument holds for intermediate concepts introduced to express n-ary relations, e.g., CareerStations [5] in DBpedia.
* CS6: In my opinion, it would make more sense to use the overall size of the vocabulary as the denominator.
* CS9: If inference is used here, then each resource is of type owl:Thing. In that case, the metric would always yield 1. If inference is not used, on the other hand, the metric would produce some false negatives.
* The accessibility metrics seem to be measured at a single point in time only. However, some endpoints are sometimes offline for a certain time, planned or unplanned, and become online again. The experiment should be designed in a way that measures accessibility in a repeated fashion.
* A3 mixes the deferencability of local and non-local URIs. This is a bit problematic, since the derefencability of a local URI lies in the responsibility of a dataset provider, while the derefencability of a non-local URI does not. In an extreme case, this metric may be optimized by omitting all dataset interlinks for avoiding being punished for a linked dataset being non-accessible. I suggest handling local and non-local URIs differently.
* A3 reports blacklisting as a possible reason for failed connections. This might be a methodological problem, as some endpoints may have an upper level of requests for a given amount of time. The same holds for PE2 and PE3.
* I1 measures the fraction of resources that are linked to a dereferencable URI. First of all, this mixes linkage and derefencability (where the latter is covered by A3, see my remarks on that metric above). Second, this may punish very specific datasets, which contain many concepts that simply do not have a counterpart in any other dataset (consider, again, a massive data cube with many observations. What should those observations be linked to?)
* For PE3, it seems odd that a server needs to respond to only one query below 1s. It should be all or at least the majority, or 95%. Furthermore, the metric counts *all* responses. This means that a server which blacklists after 5 requests issued within a minute could require 20s to answer a request, but deliver the blacklist response really fast, and still pass the metric with a full score.
Some more detailed remarks:
* On p.2, it is stated that "LOD datasets [...] are more prone to link spamming than on the Web". I doubt this (what would be the incentive for link spamming on LOD datasets?), and some evidence should be provided for that statement.
* On p.4, in the list of accepted formats, I miss a JSON-LD as a possible format
* p.4.: "Another initial experiment was performed on the LOD Cloud snapshot to check how many datasets provide some kind of machine readable license..." - some details would be appreciated. Furthermore, it would be interesting to compare the results to Schmachtenberg et al., who conducted a similar analysis
* p.6: It is stated that CC Attribution-NonCommercial-ShareAlike and Project Gutenberg are non-conformant licenses for LOD. An explanation would be appreciated.
* p.6: The atuhors state that according to SPARQLES, the number of reliable SPARQL endpoints has decreased from 242 to 181, hence, they state that 12% of the endpoints became less reliable. First, the resulting number according to my following the calcuation is 11%, not 12%. Second, from what I understand, this is not a valid conclusion. The actual number may even be higher, since the change my also be an effect of SPARQL endpoints being no longer maintained at all, and new SPARQL endpoints being added to the list.
* In section 4.1, an automatic identification of SPARQL endpoints is proposed. A similar analysis is proposed in [2], which should be referenced and compared to.
* Paragraph 4.2 is a largely redundant summary of the Venn diagram depicted in Fig.4. It should be replaced by a more thorough discussion of the conclusions the authors draw from the analysis.
* In section 5.3 for metric P2, the authors state that there is only one publisher providing metadata at triple level with the mechanism suggested. This may also be a hint that the mechanism measured is not a best practices, and that other best practices exist which are not covered by the metric at hand.
* At the end of section 5.3, the authors complain the lack of provenance information for many dataset, which may "make it hard for data consumers to re-use and adopt some dataset". Maybe this is a hen and egg problem: how many true consumers of provenance information are there in the wild?
* Table 5 lists prefix.cc as a dataset. Is this really a dataset? I would rather consider it a service.
* For CS1, the authors mention that types are inferred. Some details are required here: which level of inference is used? Does this go beyond materializing the type hierarchy (e.g., using RDFS or even OWL Lite/DL reasoning)? The same holds for other metrics from the CS* set.
* Furthermore, for the same metric, it would be important to know whether linked vocabularies are also taken into consideration. For example, a set of types T1...Tn of a resource r might be consistent, but links to another vocabulary V might lead to a different result (e.g., Ti equivalentClass Vi, Tk equivalentClass Vk, Vi disjointWith Vk). In the latter case, the conclusions should also be more fine grained (i.e., is this due to wrong vocabulary links or inconsistent typing?)
* In the section describing CS2 on p.22, there is a lenghty explanation of OWL vocabulary elements, which is not necessary for the SWJ audience.
* For CS5, how are different resources defined? Do you consider only explicit owl:differentFrom statements, apply a local unique naming assumption, or a mixture of both?
* Some metrics mention the use of samples. I would like to suggest that this is omitted from the definition of the metric as such, as it mixes to aspects: the metric captures an actual quality dimension, sampling is used to approximate the metric. Thus, the definitions should come without a metric, and the authors should mention for which metrics they used sampling.
* As far as sampling is concerned, the authors argue against reservoir sampling in section 5.5 and propose a hybrid approach on p.28, and on p.30 (footnote 33) and p.31 (PE2), they seem to use pure reservoir sampling again. This looks slightly inconsistent.
* For SV3, I would appreciate some details on how valid datatypes (e.g., numbers, dates, etc.) are checked.
* On p.30, the authors mention 977,609 external PLDs. I assume that should be URIs, not PLDs.
* The second paragraph on p.31 ("If one considers...") was not clear to me.
* The paragraph after that speaks about interlinking tools, and seems to be somewhat outplaced here.
* on p.33, the atuhors mention that not all metrics are taken into account for the average, e.g., for one dataset only accessible via a SPARQL endpoint. Hence, the final ranking actually compares apples and oranges.
In summary, there are quite a few places in which this paper needs improvement. I recommend a major revision.
Language and layout related issues:
* In general, proof-reading by a native speaker would clearly help improving the article
* The diagrams referred to are often far from the referring text. For example, on p.11, there is a reference to a diagram which is on p. 16
* In the text talking about the box plots, the authors say that those are left/right skewed, but the corresponding diagrams are top/bottom oriented
* p.3: "'O'pennness in the Linked Open Data Cloud" - I am not sure whether the 'O' is deliberate or not
* p.4: "It is the mechanism that defines whether third parties can re-use or otherwise, and to what extent." - This is an awkward sentence.
* p.20: worryingly -> worrying
* CS9: the term "datatype" is confusing here, as also types of resources are considered.
* p.33 close parantheses after "3.7 billion quads"
* p.34: Bartletta's test
[1] https://profiles2016.wordpress.com/
[2] Heiko Paulheim and Sven Hertling Discoverability of SPARQL Endpoints in Linked Open Data. ISWC Posters and Demonstrations, 2013.
[3] B. Szász, R. Fleiner and A. Micsik, "Linked data enrichment with self-unfolding URIs," 2016 IEEE 14th International Symposium on Applied Machine Intelligence and Informatics (SAMI), Herlany, 2016, pp. 305-309.
[4] http://www.w3.org/TR/vocab-data-cube/
[5] http://dbpedia.org/ontology/CareerStation
[6] Robert Meusel and Heiko Paulheim Heuristics for fixing common errors in deployed schema.org microdata. In: ESWC 2015.
|