Evaluating the Quality of the LOD Cloud: An Empirical Investigation

Tracking #: 1683-2895

Authors: 
Jeremy Debattista
Christoph Lange
Sören Auer
Dominic Cortis

Responsible editor: 
Ruben Verborgh

Submission type: 
Survey Article
Abstract: 
The increasing adoption of the Linked Data principles brought with it an unprecedented dimension to the Web, transforming the traditional Web of Documents to a vibrant information ecosystem, also known as the Web of Data. This transformation, however, does not come without any pain points. Similar to the Web of Documents, the Web of Data is heterogenous in terms of the various domains it covers. The diversity of the Web of Data is also reflected in its quality. Data quality impacts the fitness for use of the data for the application at hand, and choosing the right dataset is often a challenge for data consumers. In this quantitative empirical survey, we analyse 130 datasets (~ 3.7 billion quads), extracted from the latest Linked Open Data Cloud using 27 Linked Data quality metrics, and provide insights into the current quality conformance. Furthermore, we publish the quality metadata for each assessed dataset as Linked Data, using the Dataset Quality Vocabulary (daQ). This metadata is then used by data consumers to search and filter possible datasets based on different quality criteria. Thereafter, based on our empirical study, we present an aggregated view of the Linked Data quality in general. Finally, using the results obtained from the quality assessment empirical study, we use the Principal Component Analysis (PCA) test in order to identify the key quality indicators that can give us sufficient information about a dataset's quality. In other words, the PCA helps us identify the non-informative metrics.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Heiko Paulheim submitted on 08/Aug/2017
Suggestion:
Major Revision
Review Comment:

The authors have taken considerable efforts in improving the paper, including the addition of a number of references. Many of the issue I have raised in my original review have been addressed and discussed in the response letter, including the various points of critique I had against quite a few of the metrics used in the paper.

A part of the paper I am still not too happy with is the section where the PCA is conducted. The question in the box - i.e., which are the key features - is imho not answered. The PCA rules out three of the metrics which do not contribute much to the variance, but apart from that, the question is not answered. It should rather be understood as: which metrics can be safely removed, as they are already covered by others. Dimensionality reduction techniques could thus be more appropriate for identifying the most promising features.

Along the same lines, I do not agree with the authors' statement that closely related metrics are grouped by PCA. This is not what PCA does. Metrics ending up in the same component are not necessarily related or unrelated, they can just be cominbed linearly in a way that explains a lot of the variance in the data, regardless of whether they are semantically related or not.

I strongly appreciate that the discussion of the metrics has been extended, including possible caveats. In some cases, responses from the response letter should be included in the paper (e.g., the time-dependency of accessibility metrics).

Some points that have *not* been addressed properly in my opinion include:
* IO1: I still do not believe that this metric measures what the authors claim. Looking up a vocabulary in LOV only analyzes whether this vocabulary has been registered there, which, in turn, requires some minimal effort in describing metadata of the vocabulary. However, if I create a dataset with a proprietary vocabulary, and register that vocabulary at LOV, I do *not* reuse an existing vocabulary, but the metrics suggests so.
* CS9: I do not think it is enough to materialize the subject's and object's type. For example, the DBpedia property dbo:starring has rdfs:range dbo:Actor . A triple where the object is of the (less specific) type dbo:Person would thus be marked as a range violation, although this is probably not desired.

A remark from my original review which should still be addrssed:
* CS6: In my opinion, it would make more sense to use the overall size of the vocabulary as the denominator.

I am confident that the authors will be able to address these issues. However, since particularly the analysis of informative metrics needs careful rework, I would still recommend a major revision.

Review #2
By Amrapali Zaveri submitted on 11/Sep/2017
Suggestion:
Minor Revision
Review Comment:

The authors have satisfactorily addressed my comments.

The link https://w3id.org/lodqautor does not work so there should be some measures undertaken to ensure that the link is always active.

The one issue I have is with the sentence:
“lack of evidence, for example no algorithm presentation or brief description of the metric” in the paper and also the authors mention in the review:
“Therefore, the reader of Zaveri et al. would have to look into more detail of the metric in the reference literature. Having said that, the other literature was not always formalising these metrics, sometime it was a mere explanation of the metric and left the details to the reader’s imagination. “
While I agree that the survey paper only contains brief descriptions of the metric, I do not agree that there are no “algorithms” per say because the references provided for each of the metrics provide a methodology/technique with what that particular dimension is assessed. It is not necessarily always in an algorithm form but they do provide an approach to compute that particular dimension.

Just a comment: It is surprising that according to PCA, the metric “Links to External Data Providers” is non-informative when we are talking about Linked Data !

Also, I list some of the typos/suggestions:
Reword Figure 5 and fix Figure 8 caption
“there is a number of” - “there are a number of”
“where N-Triples” - “were N-Triples”
“[55] classify” - “[55] classifies”
“(U3) Presence of URI Regular Expression” - reword to maybe “Presence of Regular Expression defining the URI” or something similar
“bar a few number” - “barring a few number”
“authoritative” - “authorized”
“For each triple being assessed, ... “ - incomp
“analyses the text searching” - “analyses the text for searching”
“5” - “five”
“we will use the PCA” - “we will use PCA”
“statistical tests gives” - “statistical tests give”
“helps us to find” - “helps us find”

Review #3
Anonymous submitted on 03/Oct/2017
Suggestion:
Minor Revision
Review Comment:

I appreciate the time the authors have spent to revise their paper and I am satisfied with this first reviewing process. However I have a few more comments that is somehow related to one of my previous comments about the "novelty of quality metrics". I agree with the answer of the authors but reading the paper and checking all formulas I think there is space for additional improvements. In the following, I will provide more details about additionally reviewing the paper:

section 2
*there are a few more works to be included. The author should be able to highlight the similarities and differences with respect to your work. This is very fundamental in such study
-Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO
-A comprehensive quality model for Linked Data
-Literally Better: Analyzing and Improving the Quality of Literals

section 5
My main concern about the quality metrics is that they need to be better formalised. When you read this section every time there is a new concept and a new symbol added. Since this section is long and provide the main contribution of this work, it should be very easy to read and to understand. So my suggestion is to extract most of the terminology in a separate section where you introduce the definition for instance of a triple, entity, etc. and other tems and resume them also in a table with the symbolics you are going to use in section 5. Now, I will go into detail and analyse each metric:

*RC1: it is not clear from the definition what do you mean by a data level-constant? it seems to me that you are assuming from the definition that the subject and the object can be of type Uri while in the definition of a triple a subject can be of two types Uri or BN and the object can be of three types: Uri, BN or Literal.
*IO1 and IN3: you are using symbols with an overbar and I don't understand why you some time use symbols like this and sometimes not? Do you consider them as a vector?
*V1: is there only the void vocabulary to express the format? what about dcat (an altenative of void)? what is the range of the property void:feature?
*V2: so far you have referred to triples and now there is a symbol t and then t.o? what is the meaning of t.o? you should be also coherent what formaliazation language are you using?if it is a first order logic than keep it since the beginning to the end.
*P1: why should each resource have a dc:creator or a dc:publisher? I think this is more at the dataset level and only the resource referring to the dataset should have this information. Not clear at all
*P2 and U1: what is the difference between the set of entities and a set of distinct subject URIs? you should also clarify this in the background section. you should also clarify further P2. what is the weighted value of the entity? you are saying that for each entity you are giving a weight of 0.5
*U1. desc.(s,p,o) what does it mean? the dot is a typo or it has a meaning in this formalization?
*CS1: types(r) should be defined in the background/preliminaries section. Triples sometimes are expressed as SxPxO and sometimes as SPO
*CS3: now you are using within the set {t \in D. so far you have just used {t in the set. what is the difference?

section 6
*explain in table 7 what is approx chi-square and how is 999.81 interpreted? *explain also df and Sig.
*My understanding is that you use as input to the KMO and Bartlett's Test of Sphericity the quality assessment output of each dataset. How it comes that only by using the quality assessment output values you could reject the H0 hypothesis? Can you be more explicit on this?

Minor comments:
*re-write sentence: "PCA helps in finding the best possible characteristics to summarise the given data as well as possible." -> in particular the part as well as possible is very vague
*once the acronym is introduced use only that e.g. avoid this "check whether Principal Component Analysis (PCA) "