Evaluating the Quality of the LOD Cloud: An Empirical Investigation

Tracking #: 1491-2703

Jeremy Debattista
Christoph Lange
Sören Auer

Responsible editor: 
Ruben Verborgh

Submission type: 
Survey Article
The increasing adoption of the Linked Data principles brought with it an unprecedented dimension to the web, transforming the traditional Web of Documents to a vibrant information ecosystem, also known as the Web of Data. This transformation, however, does not come without any pain points. Similar to the Web of Documents, the Web of Data is heterogenous in terms of the various domains it reflects. The diversity of the Web of Data is reflected in the quality of the Web of Data.Data quality impacts the fitness for use of the data for the application at hand, and choosing the right dataset is often a challenge for data consumers. In this quantitative empirical survey, we analyse 130 datasets (~ 5 billion quads), extracted from the latest Linked Open Data Cloud using 27 Linked Data quality metrics, and provide insights into the current quality conformance. Furthermore, we published the quality metadata for each assessed dataset as Linked Data, using the Dataset Quality Vocabulary. This metadata could then be used by data consumers to search and filter possible datasets based on different quality criteria. Thereafter, based on our empirical study, we present an aggregated view of the Linked Data quality in general. Finally, using the results obtained from the quality assessment empirical study, we use the Principal Component Analysis (PCA) test in order to identify the key quality indicators that can give us sufficient information about a dataset’s quality. In other words, the PCA will help us to identify the non-informative metrics.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Heiko Paulheim submitted on 05/Jan/2017
Major Revision
Review Comment:

The article describes an in-depth empirical study of various quality indicators for Linked Open Data, using a larger-scale collection of datasets. Furthermore, the authors suggest the use of PCA to judge the utility and redundance of the individual quality metrics.

In general, the topic is interesting and useful, but the article has a few flaws that should be addressed before publication.

First of all, the mathematical notations often seem to follow rather unusual notation schemes, at least to me. Although it is possible in most cases to grasp the idea of the formulas, in particular when studying the accompanying text, the formulas should be exact in themselves, not an approximation of what is meant.

The use of PCA is interesting, but the corresponding section is only partly informative. Furthermore, my feeling is that simple correlation analysis of the metrics may yield the same information, i.e., understanding which metrics are redundant. Furthermore, I miss some conclusions here, e.g., stating that for two highly correlated metrics, it may be enough to compute one (ideally the one with the least computational efforts).

With regard to related work, there are a few more works in the area that could be mentioned. Both Zaveri et al. (52) and Schmachtenberg et al. (46) are used later, but not listed as related work. Some more related work can be found, e.g., by looking at the PROFILES workshop series [1]. Furthermore, it might be worthwhile looking at a similar analysis we conducted for schema.org [6], as we also used some similar metrics, e.g., such as CS3.

Some of the metrics discussed are arguable. Examples include:
* RC1: The authors suggest that there is a consensus that shorter URIs are better. However, there are also counter-arguments, e.g., for self-unfolding URIs [3].
* RC2: The authors themselves state an example where ordering is essential (i.e., authors of a publication). Hence, RC2 would "punish" datasets in domains where such cases exist.
* IO1: This metric is about proprietary vocabularies, and for computation, the fact whether a vocabulary is registered in LOV is used as a proxy for a vocabulary being non-proprietary. From my understanding, it would be a better proxy to analyze whether the vocabulary is used by other datasets as well.
* U1 computes the set of subject URIs (why only subjects, not objects as well?) which have a label. This may not be a good metric in some cases. For example, in a Linked Data dataset containing a data cube [4], it would not make sense to equip each single observation with a label. A similar argument holds for intermediate concepts introduced to express n-ary relations, e.g., CareerStations [5] in DBpedia.
* CS6: In my opinion, it would make more sense to use the overall size of the vocabulary as the denominator.
* CS9: If inference is used here, then each resource is of type owl:Thing. In that case, the metric would always yield 1. If inference is not used, on the other hand, the metric would produce some false negatives.
* The accessibility metrics seem to be measured at a single point in time only. However, some endpoints are sometimes offline for a certain time, planned or unplanned, and become online again. The experiment should be designed in a way that measures accessibility in a repeated fashion.
* A3 mixes the deferencability of local and non-local URIs. This is a bit problematic, since the derefencability of a local URI lies in the responsibility of a dataset provider, while the derefencability of a non-local URI does not. In an extreme case, this metric may be optimized by omitting all dataset interlinks for avoiding being punished for a linked dataset being non-accessible. I suggest handling local and non-local URIs differently.
* A3 reports blacklisting as a possible reason for failed connections. This might be a methodological problem, as some endpoints may have an upper level of requests for a given amount of time. The same holds for PE2 and PE3.
* I1 measures the fraction of resources that are linked to a dereferencable URI. First of all, this mixes linkage and derefencability (where the latter is covered by A3, see my remarks on that metric above). Second, this may punish very specific datasets, which contain many concepts that simply do not have a counterpart in any other dataset (consider, again, a massive data cube with many observations. What should those observations be linked to?)
* For PE3, it seems odd that a server needs to respond to only one query below 1s. It should be all or at least the majority, or 95%. Furthermore, the metric counts *all* responses. This means that a server which blacklists after 5 requests issued within a minute could require 20s to answer a request, but deliver the blacklist response really fast, and still pass the metric with a full score.

Some more detailed remarks:
* On p.2, it is stated that "LOD datasets [...] are more prone to link spamming than on the Web". I doubt this (what would be the incentive for link spamming on LOD datasets?), and some evidence should be provided for that statement.
* On p.4, in the list of accepted formats, I miss a JSON-LD as a possible format
* p.4.: "Another initial experiment was performed on the LOD Cloud snapshot to check how many datasets provide some kind of machine readable license..." - some details would be appreciated. Furthermore, it would be interesting to compare the results to Schmachtenberg et al., who conducted a similar analysis
* p.6: It is stated that CC Attribution-NonCommercial-ShareAlike and Project Gutenberg are non-conformant licenses for LOD. An explanation would be appreciated.
* p.6: The atuhors state that according to SPARQLES, the number of reliable SPARQL endpoints has decreased from 242 to 181, hence, they state that 12% of the endpoints became less reliable. First, the resulting number according to my following the calcuation is 11%, not 12%. Second, from what I understand, this is not a valid conclusion. The actual number may even be higher, since the change my also be an effect of SPARQL endpoints being no longer maintained at all, and new SPARQL endpoints being added to the list.
* In section 4.1, an automatic identification of SPARQL endpoints is proposed. A similar analysis is proposed in [2], which should be referenced and compared to.
* Paragraph 4.2 is a largely redundant summary of the Venn diagram depicted in Fig.4. It should be replaced by a more thorough discussion of the conclusions the authors draw from the analysis.
* In section 5.3 for metric P2, the authors state that there is only one publisher providing metadata at triple level with the mechanism suggested. This may also be a hint that the mechanism measured is not a best practices, and that other best practices exist which are not covered by the metric at hand.
* At the end of section 5.3, the authors complain the lack of provenance information for many dataset, which may "make it hard for data consumers to re-use and adopt some dataset". Maybe this is a hen and egg problem: how many true consumers of provenance information are there in the wild?
* Table 5 lists prefix.cc as a dataset. Is this really a dataset? I would rather consider it a service.
* For CS1, the authors mention that types are inferred. Some details are required here: which level of inference is used? Does this go beyond materializing the type hierarchy (e.g., using RDFS or even OWL Lite/DL reasoning)? The same holds for other metrics from the CS* set.
* Furthermore, for the same metric, it would be important to know whether linked vocabularies are also taken into consideration. For example, a set of types T1...Tn of a resource r might be consistent, but links to another vocabulary V might lead to a different result (e.g., Ti equivalentClass Vi, Tk equivalentClass Vk, Vi disjointWith Vk). In the latter case, the conclusions should also be more fine grained (i.e., is this due to wrong vocabulary links or inconsistent typing?)
* In the section describing CS2 on p.22, there is a lenghty explanation of OWL vocabulary elements, which is not necessary for the SWJ audience.
* For CS5, how are different resources defined? Do you consider only explicit owl:differentFrom statements, apply a local unique naming assumption, or a mixture of both?
* Some metrics mention the use of samples. I would like to suggest that this is omitted from the definition of the metric as such, as it mixes to aspects: the metric captures an actual quality dimension, sampling is used to approximate the metric. Thus, the definitions should come without a metric, and the authors should mention for which metrics they used sampling.
* As far as sampling is concerned, the authors argue against reservoir sampling in section 5.5 and propose a hybrid approach on p.28, and on p.30 (footnote 33) and p.31 (PE2), they seem to use pure reservoir sampling again. This looks slightly inconsistent.
* For SV3, I would appreciate some details on how valid datatypes (e.g., numbers, dates, etc.) are checked.
* On p.30, the authors mention 977,609 external PLDs. I assume that should be URIs, not PLDs.
* The second paragraph on p.31 ("If one considers...") was not clear to me.
* The paragraph after that speaks about interlinking tools, and seems to be somewhat outplaced here.
* on p.33, the atuhors mention that not all metrics are taken into account for the average, e.g., for one dataset only accessible via a SPARQL endpoint. Hence, the final ranking actually compares apples and oranges.

In summary, there are quite a few places in which this paper needs improvement. I recommend a major revision.

Language and layout related issues:
* In general, proof-reading by a native speaker would clearly help improving the article
* The diagrams referred to are often far from the referring text. For example, on p.11, there is a reference to a diagram which is on p. 16
* In the text talking about the box plots, the authors say that those are left/right skewed, but the corresponding diagrams are top/bottom oriented
* p.3: "'O'pennness in the Linked Open Data Cloud" - I am not sure whether the 'O' is deliberate or not
* p.4: "It is the mechanism that defines whether third parties can re-use or otherwise, and to what extent." - This is an awkward sentence.
* p.20: worryingly -> worrying
* CS9: the term "datatype" is confusing here, as also types of resources are considered.
* p.33 close parantheses after "3.7 billion quads"
* p.34: Bartletta's test

[1] https://profiles2016.wordpress.com/
[2] Heiko Paulheim and Sven Hertling Discoverability of SPARQL Endpoints in Linked Open Data. ISWC Posters and Demonstrations, 2013.
[3] B. Szász, R. Fleiner and A. Micsik, "Linked data enrichment with self-unfolding URIs," 2016 IEEE 14th International Symposium on Applied Machine Intelligence and Informatics (SAMI), Herlany, 2016, pp. 305-309.
[4] http://www.w3.org/TR/vocab-data-cube/
[5] http://dbpedia.org/ontology/CareerStation
[6] Robert Meusel and Heiko Paulheim Heuristics for fixing common errors in deployed schema.org microdata. In: ESWC 2015.

Review #2
Anonymous submitted on 06/Jan/2017
Major Revision
Review Comment:

This paper provides an empirical analysis on the topic of Linked Data quality. The author initially discuss the accessibility of the datasets from the LOD cloud then they provide 27 metrics identified in the survey paper of Zaveri et al 2016. Each metric is formalized and analyzed in detail for the 130 datasets that are accessible in the LOD cloud. The goal is to provide a quality ranking of the datasets that is calculated based on the aggregation of 27 quality metrics considered in the analysis.

The paper is clearly on topic for this journal and tackles a very important issue for the Semantic Web/Linked Data community. Although the breadth of the measurement of metrics on which it is based is quite detailed, the analysis about the informativeness of the metrics (section 6) is very narrow.

In general, I like how the authors set out to provide the evaluation of quality. However, in terms of execution of the paper, I myself found the discussion of the metrics very straightforward and I didn’t find any novelty since these metrics were already discussed in an existing work (survey of Zaveri et al. 2016) and were formally provided in Hogan et al. 2012. While the previous two works provide a conceptualization and formalization of dimensions and metrics, here only statistical values may be provided and discussed. Further, the paper is quite detailed - as would be expected in a survey. Section 5 especially makes for tedious reading. I think this section need to be tightened up a lot more and discussed better. Taking one small example, RC2 is a metric provided in Zaveri et al. 2016 and Hogan et al. 2012. I really fail to see why so many details given for such metric. Throughout this section, I encountered this issue again and again.
Returning to the CfP text, I could imagine a practitioner or similar, after reading this paper, and asking himself and then what? At least I know that I would struggle to do so having read the paper. Furthermore, I think the practitioner would struggle to say which dataset to choose based on the overall quality metrics provided.

Another general comment is that the metrics provided by the authors requires in most of the cases a vocabulary where the properties and classes are defined. This is not always the case. We can find in the LOD cloud a lot of datasets that do not have an ontology. In this case, a reference to the incorrect or incomplete value is difficult to be determined. Thus, I think that this work is very limited to all the cases where each dataset has an ontology (quite straightforward).

High-level suggested improvements:
* Revise all the metrics. It is not important to give a formula rather than formalize better the conditions which determine the input to the formula. Usually a formula is a simple division between two values and to be normalized it should be subtracted from 1. Consolidate the metrics and their calculations
* Improve the writing of the paper throughout (including, but not limited to, minor comments). The writing is also a contribution!
* Mostly I suggest tightening up the text. The paper should be shorter, not longer.


Sec 2
You mention different works in the state of the art but for most of them you do not provide a clear distinction. What is the advancement of your work with respect to the state of the art? For example, in the description of the work of Hogan et al. you do not provide what are the difference between yours and theirs, which is very important for this kind of paper.

569 - correct to 570 datasets
239 - correct to 374 datasets in the datahube.io

Sec 4
Fig. 3 -> I think there is an imprecision in the figure. It is not possible to consider the SPARQL endpoint and the data dump only if the dataset does not provide a voID description. In practice, if you ask “all distributions Processed?” and answer “NO”, you will go always to the “Has valid voID Description?” and if the answer is “YES”, you will never be able to go to the other two questions.
I think that a dataset may have a voID description and at the same time a SPARQL endpoint. But this is not possible in the schema you are showing.

Section 4.2 seems to be obsolete. I checked the data with those already provided in section 3.1 and I find them inconsistent. How do you explain that? How is this different from section 3.1? I find also fig4 don't contribute anything of significance.

Sec 5
This section is quite long without providing much of note. This subsection should certainly be shortened (or greatly improved/clarified).

RC1 is not precise. Instead of AND, you should use OR between the two conditions. Is the result of the metric a normalized value between 0 and 1? Instead of size(), I would suggest |u| that identifies already the cardinality of the set of URIs.
I don’t understand why you propose a new metric while the metric was already implemented by Hogan et al.

Sometimes it seems you are giving suggestions, e.g., p13 second paragraph “In order to improve schema re-use, services such as LOV and Swoogle should be used to find suit- able schemas. On the other side, vocabulary curators should maintain and promote their schemas, for ex-ample, by making sure that vocabularies are properly dereferenceable.”
You should either do for each section or remove it in order to uniformize all sections.

Table 4-> how could you explain the difference of v(C,1.0) for the first two domains when all the values for RC1, RC2, IO1, IO2, IN3, IN4, V1 and V2 are the same for both domains?

P1 is it sufficient to say that there should exist at maximum one triple of the form subject x dc:creator or dc:publisher x object? Provenance Information are not limited to only these two predicates. What happens if a dataset has more provenance information? Is it better or the same as a dataset having only one provenance information. Your basic(d) returns always 1.

U1 is the list of the predicates exhaustive enough to be taken into consideration for each new dataset?

CN2 -> is not clearly explained. How do you capture that two entities represented by two different URIs maybe the same thing? Please as said before, be more precise and formalize what do you call a redundant entity.

CS1-> most of the datasets published in the LOD does not have a vocabulary or when they have it, not every relation owl:disjointWith is settled between two classes.

CS2-> it seems that you already know the properties of a dataset that are defined in a vocabulary. And according to me this is the most simple case and makes the measurement straightforward. What about those datasets that does have undefined properties or classes?

Table 6 -> CS1, CS2, CS4, CS6 are always 100%. Isn’t it wired? Either the metric is not relevant for those datasets or smth has gone wrong.
How it comes that the last two domains has this values of ranking? Since the weight is always 1 then shouldn’t the aggregation of the last metrics be higher than that of the previous domain?

Table 8 -> L2 is alway 0% except the case for http://ecb.270a.info/ that is 100%. Isn’t it wired? How it comes that only for this domain we can find all possible human readable licenses and for the others not? It seems that is has all human readable licenses but not any machine readable license.

Sec 6
What about CS1 that has a value (-0.75) lower than the cut-off that is 0.5?
More discussion of the results in this section would illustrate the informativeness of the metrics better.

Review #3
By Amrapali Zaveri submitted on 29/Jan/2017
Major Revision
Review Comment:

The article "Evaluating the Quality of the LOD Cloud: An Empirical Investigation" is a large-scale quantitative empirical survey of 130 datasets extracted from the Linked Open Data Cloud using 27 Linked Data quality metrics. The analysis provides insight into the current quality conformance of the existing LD datasets. The results are published as a quality metadata graph for each assessed dataset, the results themselves published as Linked Data resources, represented using the Dataset Quality Vocabulary. Additionally, with the help of the tool, users can search, filter and rank datasets according to a number of quality criteria, and more easily discover the relevant datasets according to their use cases. Furthermore, a statistical analysis of the assessment results is performed in order to determine which of the chosen quality metrics are key quality indicators.

Overall, the study is an important one on this topic as the previous similar large-scale study was done by Hogan et al. in 2012. The 27 metrics have been selected from the survey by Zaveri et al. and implemented to measure each one of them against the datasets. The authors provide results for each metric and an assessment of the state of the Linked Datasets according to each metric.

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
To get an idea of the current state of the Linked Data Cloud datasets (at least until 2014) in terms of data quality, this paper can serve as a suitable introductory text as well as guidelines of best practices for Linked Data publishing that should be followed but are not currently. Moreover, the results are reproducible with the help of the tool, Luzzu.

However, there are a number of questions that arise that need clarification/explanation for the rationale:
- What is the reasoning behind choosing only 27 metrics out of the 69 reported in the survey?
- Why is the W3C's Data Quality Vocabulary not used in expressing the results metadata?
- There should be a rationale provided for the Section 3. Why is openness discussed here? This section does not quite fit in the flow of the paper. Moreover, there is a lot of stress on metadata in this section, which is also mentioned in Section 2 and could essentially be merged. Also, I dont see the point of referring to [6] in this section? And again [5] is repeated here. Better still, Openness could be proposed as yet another quality dimension/metric or be merged with the analysis performed for metric P1 perhaps. Similarly, Section 3.1 and 3.2 could be merged with the Accessibility dimensions' metrics.
- The total no. of datasets and total no. of triples should also be mentioned in the Introduction. In fact there is a discrepancy in the number: In the Abstract it is 5 billion quads whereas in Section 5.6 it is 3.7 billion quads.
- Instead of reporting only top and bottom five datasets, also reporting interesting results for each metric would be more interesting. In fact, (optionally) it would be great to even have figures similar to 1 & 2 for each of the metrics to get a visual overview of the state of the datasets.
- For all the metrics discussions, first of all, the structure should be consistent. First, I also would like to see a list of the metrics upfront after each category is discussed, and have each of the metrics as a numbered sub-section as it would be easy for the authors to navigate through the sections. Secondly, I recommend the authors to add peculiar examples encountered when analyzing the results for each of the metrics. For example, "37% of the defined vocabularies were not used in their respective datasets...".
- For metric RC1, how is your metric more flexible when you have specified that the length of the URIs should be 80 characters or less?
- How is metric P2 different from P1?
- The actual number of derefernceable PLDs is 3086 is this out of 977,609 PLDs? How do you calculate the total possible external PLDs?
- Do you also plan to incorporate other subjective metrics from the survey in the future?
- Also, what are the plans for the future in terms of assessment of these metrics at regular intervals?

(2) How comprehensive and how balanced is the presentation and coverage.
- Since the work derives from [52], it should be mentioned here too and here provide the rationale of choosing only particular metrics out of the entire 69 in the survey.
- Also provide the similarity of your work with [26].
- How is the assessment of vocabularies [48] relevant (or related to) your work? In that, the paper about "Five Stars of Linked Data Vocabulary Use" should also be cited: http://www.semantic-web-journal.net/content/five-stars-linked-data-vocab....
- Another interesting work that should be cited is http://content.iospress.com/articles/semantic-web/sw216.

(3) Readability and clarity of the presentation.
The paper is relatively easy to follow, however there are a number of formal errors which are listed here:
- "web" - "Web"
- "we published" - "we publish"
- "help us to identify" - "help us identify"
- Data Quality Vocabulary - add W3C's Data Quality Vocabulary (?)

Section 1
- "summarised into the" - "summarised as"
- "such page rank" - "such as Page Rank"
- "usually a complex" - "usually complex"
- The research question should probably be rephrased to "What is the quality of existing Data on the Linked Data Web?"
- Be consistent in referring to either "linked Data" or "Linked Open Data" and also "linked data" and "Linked Data"
- Minimize use of words such as "could", "may" etc.
- "twenty-seven" - 27

Section 2
- "analyses" - "analyse"
- "conduct" - "conducted"
- Format footnote 2 correctly.

Section 3
- Add reference for "Heath and Bizer"

Section 3.1
- Footnote 6 and 7 - full stop missing at the end.

Section 3.2
- "datahub" - "DataHub"
- "Table 3.2" does not exist.

Section 3.3
- "should be part of" - "should not be part of"
- "stood 242" - "stood at 242"

Section 5
- "work that survey" - "work that the survey"

Section 5.1
- "DBTropes" - add reference

Section 5.3
- vis-Ãa ̆-vis - fix
- "in the assessed dataset" - full stop missing
- "population population" - "population"
- "worringly" - reword

Section 5.4
- "Zaveri et al. [52] classifies" - "Zaveri et al. [52] classify"
- "This metric analyse" - "This metric analyses"
- "have a domain" - "have domain"
- "which sample can be" - "which can be"
- "Turtle 1.1" - add reference/link
- "use adapt" - "adapt"

Section 5.5
- The Licensing section seems to have abruptly been started.
- "a sampler of maximum 25 items.Estimation" - "a sample of maximum 25 items. Estimation"
- Footnote 33 - bracket missing
- There isn't a requirement to describe Silk and LIMES as is done currently.
- "10 local resources" - "10 local resources from each dataset"

Section 5.6
- "exception" - "exceptions"

Section 6
- "BartlettâĂŹs" - fix
- In Table 10, I would also put in the short-forms for the metrics.

- Please refer to the correct version for [52]: http://iospress.metapress.com/content/k4434840668h5841/
- Capitalize words such as "ccrel", "rdf", "void", "lod", "limes", "uris", "owl"
- Add link to reference [2].
- Reference [41] is incomplete.

(4) Importance of the covered material to the broader Semantic Web community.
Data quality is an important topic in the Semantic Web community but is not given enough importance. Thus, having a tool and performing quality assessments on LOD is quite the need-of-the-hour. Providing the data quality results metadata as dereferenceable Linked Data resources, which the datasets publishers can directly plug into their dataset description, is a good practice and should be encouraged. Moreover, this paper gives a good overview of the current state of the datasets in LOD in terms of different metrics. Specifically it points out the current problems in data publishing that hinder the uptake of the Linked Data Web.