Editorial Board

Editor-in-Chief
Krzysztof Janowicz

Managing Editors
Cogan Shimizu
Eva Blomqvist

Editorial Board
Mehwish Alam
Claudia d’Amato
Stefano Borgo
Boyan Brodaric
Philipp Cimiano
Michael Cochez
Oscar Corcho
Bernardo Cuenca-Grau
Elena Demidova
Jerome Euzenat
Mark Gahegan
Aldo Gangemi
Anna Lisa Gentile
Rafael Goncalves
Dagmar Gromann
Armin Haller
Pascal Hitzler
Aidan Hogan
Katja Hose
Eero Hyvönen
Sabrina Kirrane
Agnieszka Lawrynowicz
Freddy Lecue
Maria Maleshkova
Raghava Mutharaju
Axel Polleres
Guilin Qi
Marta Sabou
Harald Sack
Christoph Schlieder
Stefan Schlobach
Cogan Shimizu
Blerina Spahiu
GQ Zhang
Rui Zhu

Former/Founding Editors-in-Chief
Pascal Hitzler

Editorial Assistants
Michael McCain

Syndicate

Quality Assessment Methodologies for Linked Open Data

Submitted by Amrapali Zaveri on 12/14/2012 - 09:01

Tracking #: 414-1533

A new version of this paper is available

Authors:

Amrapali Zaveri

Anisa Rula

Andrea Maurino

Ricardo Pietrobon

Jens Lehmann

Sören Auer

Responsible editor:

Pascal Hitzler

Submission type:

Survey Article

Abstract:

The development and standardization of semantic web technologies have resulted in an unprecedented volume of data being published on the Web as Linked Open Data (LOD). However, we observe widely varying data quality ranging from extensively curated datasets to crowd-sourced and extracted data of relatively low quality. Data quality is commonly conceived as fitness of use. Consequently, a key challenge is to determine the data quality wrt. a particular use case. In this article, we present the results of a systematic review of approaches for assessing the data quality of LOD. We gather existing approaches and compare and group them under a common classification scheme. In particular, we unify and formalise commonly used terminologies across papers related to data quality. Additionally, a comprehensive list of the dimensions and metrics is presented. The aim of this article is to provide researchers and data curators a comprehensive understanding of existing work, thereby encouraging further experimentation and the development of new approaches focused towards data quality.

Full PDF Version:

swj414.pdf

Revised Version:

Quality Assessment for Linked Open Data: A Survey

Tags:

Reviewed

Decision/Status:

Major Revision

Solicited Reviews:

Click to Expand/Collapse

Review #1

By Aidan Hogan submitted on 19/Feb/2013

Suggestion:
Major Revision

Review Comment:

This paper provides a survey of literature on the topic of Linked Data quality. The paper introduces a survey methodology to locate and narrow down to 21 in-scope peer-reviewed papers that deal with the issue of Linked Data quality. The authors then propose a categorisation scheme of Linked Data quality "dimensions" that summarises and organises those treated by the in-scope literature. The goal here is to consolidate the literature -- which talks about the same or similar issues in different ways -- into a consistent nomenclature and conceptualisation. This broad conceptualisation of Linked Data quality is then discussed in detail, constituting an overview of the selected material. Thereafter, the authors present a comparison of the selected papers/proposals in terms of the quality dimensions covered, the granularity of data covered, the provision of tools, and the mode of evaluation (manual; semi-automatic; automatic).

The paper is clearly on topic for this journal and tackles a very important and non-trivial issue for the Semantic Web/Linked Data community. The survey is, to the best of my knowledge, novel in its breadth and depth. I do, however, have a few concerns about the paper and in particular with a lack of clarity in how it conveys its ideas. I'll continue with the formal criteria for reviewing a survey paper as per the CfP.

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

The paper constitutes a comprehensive survey and distils the relevant material into a coherent framework. In general, I like how the authors set out to provide the survey. However, in terms of execution of the paper, even being quite familiar with issues relating to Linked Data quality, I myself found the discussion confusing in many parts and difficult to intuitively grasp. Primarily, there is a lot of conceptual overlap between various dimensions presented and I found it difficult to distinguish what makes a particular criterion unique, or in some cases, how a particular dimension relates to Linked Data quality. I think the conceptualisation needs to be tightened up a lot more and discussed better if it is to serve as an introductory text.

Taking one small example, in Section 3.2, the authors list Completeness, Amount-of-data and Relevancy. I really fail to see three distinct criteria here. Completeness deals with the issue of there being sufficiently complete data within a scope to satisfy a particular task. Relevancy encompasses the issue of the data being important and not containing too much “fluff”. Amount-of-data sits between the two but is not distinguished from either. Having looked through the section a couple of times, I fail to see what makes Amount-of-data a unique criterion here. The paper states “In essence, this dimension conveys the importance of not having too much unnecessary information”. But how is that different from Relevancy? Or Conciseness?

Throughout the paper, I encountered this issue again and again. I'm reading through the criteria but it's not at all clear to me how a specific criterion relates to the others. Quite a few seem redundant. The authors partially tackle this problem by introducing "Relation between dimensions" at the end of the section. However, for me (i) this is too late, in that I want the distinction to be naturally made before/as I read the criteria, not afterwards; (ii) this text is not always sufficient to distinguish why there are different criteria; (iii) the text is often confusing and involves cyclical justifications. As an example of (iii): “For instance, if a dataset is incomplete for a particular purpose, the amount of data is insufficient. However, if the amount of data is too large, it could be that irrelevant data is provided, which affects the relevance dimension.” This does not clarify or justify the dimensions for me at all.

To be more explicit, here's *some* categories I found redundant or unclear as to their distinction:

* Amount-of-data (vs. Completeness/Relevancy)
* Conciseness vs. Relevancy
* Consistency (wrt. Validity)
* Believability vs. Reputation
* Provenance as a distinct quality issue (vs. maybe Completeness)
* Response time (wrt. Performance)
* Representational-conciseness (vs. Conciseness)
* Representational-consistency vs. Interpretability
* Volatility as a quality issue (Is volatility as defined good or bad? Or is this again a Completeness issue? That information about Volatility is provided?)
* Timeliness as a data quality issue (vs. a query engine issue)

I appreciate that Linked Data quality is a complex issue, but that's all the more reason for the authors to make their conceptualisation as simple as possible (and no simpler). There are various pieces of text in the paper that try to distinguish the different criteria, but they are either too vague or too difficult to intuit/understand.

As a side-effect of this seeming redundancy, it seems to me that the authors partially fail in consolidating different proposals by different papers under the one nomenclature: if I look at some of the tables, I see that pretty much identical issues in different papers are listed under different categories where I personally cannot draw any distinction. For example, in Table 4, there are a lot of duplicated issues presented under "Validity-of-documents" and "Consistency" (e.g., use of undefined classes and properties, ill-typed literals, etc.). This problem occurs in a number of places.

The running example is very useful (and important), but I am concerned that, at times, it is too far removed from the question of Linked Data quality and takes too many liberties. For example, sometimes it is not clearly distinguished whether the quality issue in question is a function of the data being published, or whether the quality issue is an artefact of the system itself (cf. Timeliness).

Finally, I feel that parts of Section 4 are really weak and don't contribute anything of significance. In particular, Section 4.5 is quite long without providing much of note. Having read it, I'm still unclear as to what the tools actually *do*, or how that relates to facets of Linked Data quality. This subsection should certainly be shortened (or greatly improved/clarified).

Returning to the CfP text, I could imagine a PhD student or similar, after reading this paper, severely struggling to explain what the conceptualisation of Linked Data quality is to another student, or to distinguish the different dimensions. At least I know that I would struggle to do so having read the paper. Furthermore, I think the student would struggle to say what the individual surveyed papers have done.

(2) How comprehensive and how balanced is the presentation and coverage.

The paper seems comprehensive and pretty well balanced. The authors clearly set out the methodology and scope of the literature survey, which I appreciate. Since this is a survey, there are a couple of other notable works I can suggest as worth a mention (even if not included in the detailed comparison):

* General SW quality discussion:
Vrandecic: "Ontology Evaluation". Phd Thesis. http://www.aifb.kit.edu/images/b/b5/OntologyEvaluation.pdf

* Interlinking/Accuracy
Halpin et al.: "When owl: sameAs Isn't the Same: An Analysis of Identity in Linked Data."" International Semantic Web Conference (1) 2010: 305-320

* Representational consistency/Interpretability
Ding & Finin: "Characterizing the Semantic Web on the Web." ISWC, pages 242–257, November 2006.

* Completeness/Currency
Möller et al.: "Learning from Linked Open Data usage: Patterns & metrics." In WebScience 2010, 2010.

(3) Readability and clarity of the presentation.

The paper is well written in most parts but not well written in other parts. I think it is vital for a paper like this (whose sole contribution is the text itself) to be *very* well written. It certainly starts strongly. However, there are various awkwardly constructed paragraphs and confusing statements throughout the middle of the paper (mostly in Section 3), which hinder its comprehension. Furthermore, I think the formatting of tables could be improved, and I don't like the use of numbered references for a survey like this (I keep asking, which is paper number X again?). I try to provide an *incomplete* list of minor comments along these lines at the end of the review. As well as addressing these, I strongly encourage the authors to go through the paper again and try to sharpen the writing throughout.

Also, I note a few inaccuracies in the paper. I understand that summarising the content of 21 papers is a difficult task, but based purely on the papers I've personally been involved in (for which this was easy for me to recognise), I found that the attributions were muddled in parts. For example, in Table 5, [26] has nothing to do with SPARQL that I know of. Other such errors I noticed are mentioned in the minor comments. It is of course vital that this paper provides proper attribution to the correct references, so I encourage the authors to go through these again.

Finally, the decision to include Figure 4 (a snapshot of I'm not really sure what in the German language) is a very curious one. It should at least be in English but I suggest just remove it altogether.

(4) Importance of the covered material to the broader Semantic Web community.

The material is clearly very important. Though currently flawed, I feel the paper has the obvious potential to be a very important contribution to the community. The authors just need to invest a good bit more time and effort tightening up the conceptualisation and the writing.

High-level suggested improvements:
* Revise and hopefully simply the presented dimensions. At the very least, the dimensions must be clearly and unambiguously distinguished from each other.
* Consolidate similar measures across different papers in the various tables under one dimension.
* Make sure that the examples clearly relate back to *Linked Data quality*, and in particular the quality of underlying *data*.
* Improve the writing of the paper throughout (including, but not limited to, minor comments), particularly as this is a survey paper and the writing is the contribution!
* A possible route to improve the readability of the paper is to instead use *one example* at the *start* of each subsection. For example, at the start of 3.2.2, for Trust dimensions, introduce an example for trust that covers all of the relevant dimensions and how they are distinct from each other. This example will provide, up front, the intuition and distinction behind (i) what the dimensions are, (ii) how they are distinct, (iii) how they are related, including to previously discussed dimensions for other categories. The examples should ideally fit together in such a way to give an impression of progress while reading the paper, and an impression of “completeness” when finished. This is just a suggestion, but I think a really coherent running example will improve the paper greatly.
* Consider adding the references above, even if considered out-of-scope for the main survey.
* Shorten Section 4, and in particular, Section 4.5.
* Double-check and correct attributions throughout.
* "Relations between dimensions" sections should clearly also *distinguish* between dimensions.
* Mostly I suggest tightening up the text. The paper should be shorter, not longer.

==== MINOR COMMENTS *INCOMPLETE* ====

Throughout:
* LOD = Linking Open Data (a W3C project)? Not Linked Open Data?
* You refer to papers as "In [42]" or "the authors of [27]" or so forth. Particularly as this is a survey, I'd prefer if author names are provided up front (or as part of the reference format, if allowed). This way, it will be much easier to track which paper is being talked about and the reader doesn't have to recall, e.g., which paper [42] is again. You do this already in some places, e.g., "Bizer et al. [6] ...".
* Mixed British and American English. On that, perhaps be slightly wary of verbatim quotes from papers without proper quoting mechanisms?
* "eg." Maybe "e.g."?
* I couldn't get my hands on text for [14]. Could you provide a link with the reference?
* "interlinking" Always small case. Maybe an erroneous find/replace?
* A number of entries in the tables don't have references. Were there any criteria by which you decide to extend the list of issues not mentioned in the focus papers?
* When formulas are presented in the tables, even with informal terms, I would prefer to see them typeset as formulas, or at least in math mode. For example "-" is not a minus symbol, it's a very short hyphen; "$-$" is a minus symbol. Maybe use $\frac{}{}$ as well? It seems like it would fit.
* "Flemming et al." should just be "Flemming"

Section 1:
* "Specifically, biological and health care data ..." It sounds like LOD only covers these domains. Perhaps state: "For example, biological ..."?
Section 2:
* "did non" -> "did not"
* Flemming’s thesis was peer-reviewed? Certainly it should be included, but by the same token, I think there are probably a few other theses that should also be mentioned, in particular Denny Vrandecic’s PhD thesis. In any case, just an observation.

Section 3:
* Is the definition of RDF triples/datasets useful? That notation is never used elsewhere? I would suggest to remove it.
* "... web-based information systems design which integrate" Rephrase
* "which are prone of the non ..." Rephrase
* "Thus, data quality problem ..." Rephrase
* "There are three dimensions ... in Table 2" Rephrase
* "in [42], the two types defined" Punctuate/rephrase.
* "(1) {a} propagation algorithm" -> "(1) a propagation algorithm"?
* "a publisher"
* "Verifiability can be measured either by an unbiased third party..." Rephrase
* "whether to accept {the} provided information"
* I'm not sure in Accuracy about the Paris example, or the decision to map inaccurate labelling/classification to Accuracy. That's more a consumer side problem, right? Paris (Texas) is still called Paris.
* Table 4: lots of duplication between issues for [14] and [27]
* "An RDF validator can be used to parse the RDF document ... in accordance with the RDF specification." I feel it's important to note here that RDF is a data-model and has many syntaxes. There is no one specification for which syntax to parse. My concern is that is sounds like there's one syntax (RDF/XML).
* "but inconsistent when taking the OWL2-QL reasoning profile into account." Why OWL 2 QL?
* "B123" Should be A123?
* Figure 3: Where's OWL 2 DL and OWL 2 Full? N3 is not RDF; it's a superset. Why include DAML? What is LD Resource Description? Would prefer to mention JSON-LD and not just generic JSON.
* 'uniqueness' -> `uniqueness'
* "it can be measured as"
* "be able to serve 100 simultaneous users"
* Table 5: First two issues are identical?
* Table 5: [26] has nothing to do with SPARQL?
* Table 5: "When a URI"
* Table 5: Would like to see "no dereferenced forward-links" (locally known triples where the local URI is mentioned in the subject) mentioned in kind.
* Avoiding the use of prolix RDF features seems a strange addition to scalability, though I roughly get the idea of why that connection is made. Perhaps conciseness or representational-conciseness are a better fit?
* "Our flight search engine should provide only that information related to the location rather than returning a chain of other properties." Again, this sounds like a problem with the search engine, not with the data. The data should not be expected to contain precisely all and only relevant data for each user query. It's up to the search engine to filter non-relevant information for a given request.
* "... RDF data in N3 format". Though not strictly incorrect, N3 is a superset of RDF. Better to say "... RDF data in Turtle format."
* Table 6: Nitpick. "detection of the non-standard usage of collections ..." [26] looks at non-standard usage of collections, containers or reification. [27] looks at *any* usage (since these features are discouraged from use by Linked Data guidelines).
* "turtle" -> "Turtle"
* "In order to avoid interoperability issue{s},"
* "Even though there is no central repository of existing vocabularies, ..." What about Linked Open Vocabularies? http://lov.okfn.org/dataset/lov/
* "in the spatial dataset<4>"
* Table 7: the footnotes appear on another page.
* "and it<'>s type"

Section 4:
* "We notice that most of {the} metrics ..."
* Footnote 23: http://swse.deri.org/RDFAlerts/. Hopefully it still works. :) By the way, why is it considered semi-automatic and not automatic?
* Figure 4: Remove or get an English view. I don't think it really serves any purpose here, though I don't speak German so ...
* Section 4.5: I really found it difficult to grasp what the tools (particularly for "Sieve") were doing and the section falls flat. I'd suggest to shorten this section (or improve it a lot).

* This list is very much incomplete! *

Review #2

By Aba-Sah Dadzie submitted on 16/Mar/2013

Suggestion:
Major Revision

Review Comment:

The authors define data quality for LOD weighted by fitness for use in a specified context, and continue to break down quality measures into 5 defined dimensions. The aim of the survey is to compare approaches in the state of the art and quantitative metrics and subjective measures for assessing data quality, to provide explicit measures of quality for LD specifically. The authors find that this is especially important for end users to make effective use of this open, albeit structured and dynamic data.
The survey is based on a small range of recent papers published mostly at conferences and workshops in the Semantic Web community, and a few others in Data/Information/Knowledge Engineering publications.

The survey is quite well timed, as the SW community recognises the need to address challenges by different user types, both technical and not, in "consuming" LD. However, while it covers a good set of the properties of quality that need to be addressed, the breadth of the literature on which it is based is quite narrow. It is a fair argument to make that the field is still relatively new, but this makes it even more important to show where the community can learn from established and verified work in relevant fields. Also, there is no good reason to reinvent the wheel. Starting from existing approaches to measuring data quality, with a focus on other structured data types, would also illustrate what else (or new) this particular approach (LOD) to curating and sharing new data brings to established methods.

What criteria were used to select the initial set of articles, then narrow down to the 21 that form the basis of the survey - this information is finally provided in section 2 - a forward reference is needed in the introduction. Also, building this into fig.1 would make it more informative. One key piece of information is missing in section 2 - the criteria applied for "Extracting data for quantitative and qualitative analysis" - which pruned down from 68 to 21.
Further, the authors list 4 reasons for carrying out a review such as that they report - how are these derived? If from existing literature this must be properly cited. If the authors' own work this should be stated, with evidence supporting their validity.

The paper is quite detailed - as would be expected in a survey. It could however be shortened to some extent - section 3 especially makes for tedious reading - the background work could be more concisely presented, and repetition eliminated, especially between the definitions and the explanatory text. S.4.5 could also be summarised.

In the conclusions the authors state that their survey classifies tools in relation to, among others, "level of user knowledge" - this is not a header in Table 9 - is this derived? Further, is this knowledge wrt to the tool or the data? Also, why is the implementation of a tool a requirement for a quality methodology proposed in a publication? As their own survey shows, a lot of the measures are not directly measurable, but to a large extent derived using qualitative means. Finally, I would argue that any tool or approach that claims to measure a large and not very closely related set of properties (for data quality) is less likely to be reliable, or at best will give a very high level view - that the approaches were mostly found to be quite specific should be seen as a positive.

DETAILED REVIEW

The inclusion criterion: "Studies that did non propose any methodology or framework for the assessment of quality in LOD" - this is a very narrow definition - LOD is still fairly new - how was this applied? Surely (see above), data quality measures for other structured data types would be useful here - in fact, the community should learn from what are still accepted, tried and tested approaches. This is reinforced by the fact that only 5 of the 21 are journal papers, and all from a single SW journal, and a fair number are workshop papers. Without dismissing the value of other publication types, the scope for reporting detailed analysis and research in full conference papers and journals is much higher - for this type of survey this is what is needed. This makes the findings difficult to validate and generalise - surveys are meant to be reference points in a selected topic, and by definition should throw a much wider net (at least to start with) than is done here.
The second search query is confusing - parentheses must be used to clarify whether the AND applies only to the preceding term or not - the relation between the two terms makes it impossible to guess. Why did the authors not search in article keyword lists - these are probably the most important set of metadata for classifying publications.

Given that the definition of RDF triples, graphs and datasets should not change (with context), such a detailed definition in addition to the citations is probably superfluous.

There are a few hanging statements through the paper - e.g., at the start the authors specifically refer to LD covering biology and health care, but do not give any reason for doing so. A bit further down they state:
"For developing a medical application, on the other hand, the quality of DBpedia is probably completely insufficient." - WHY? there are any number of reasons, the reader shouldn't be left to guess. Also, are the two examples by any chance related?

p.3 "Additionally, the prior review only focused on the data quality dimensions identified in the constituent approaches. " - which - reference 3 or 34?

"When a user is looking for flights between any two cities, only relevant information i.e. start and end times, duration and cost per person should be provided. If a lot of irrelevant data is included in the spatial data, e.g. post offices, trees etc. (present in LinkedGeoData), query performance can decrease." More relevant examples of superfluous data would illustrate the problem better.

The argument about "too large" data being an indication of noise is debatable - relatively large amounts of noise may be found in small datasets, and very little noise in very large datasets. In the same vein, incompleteness may not simply be attributable to insufficient amount - again, a very large amount of data may still be incomplete, simply because some attributes are not properly filled, even with a low %age of noise.

I don't see why the first point under verifiability is subjective - the publishers and authors are fixed entities.
Point 3 - is the third party human or machine? If the latter should it not be objective?
The authors mention "sources with low believability or reputation" - is this meant to be negative, as in the sources have been proved to be unreliable, or does this include sources that simply have not yet established trust, e.g., because they are new and/or don't provide enough information with which to make that judgement?

The believability measure is classified as an objective type, however the authors write "In Linked Data, believability can be subjectively measured by analyzing the provenance information of the dataset."

The authors state that "Objectivity can not be measured qualitatively but indirectly by checking the authenticity of ..." - this is confusing... saying that the measure is indirect is more a property of a qualitative than a quantitative measure. Also, the three sub-dimensions are all set to 'S'. The section concludes with "Measuring the bias in qualitative data is far more challenging. " - what is meant by "qualitative data" here - this is not normally an adjective used to describe "data" as being used in this paper (i.e., LD).

For "consistency" - "no redefinition of existing properties" & "ambiguous annotation" - why are these 'S' - I would assume they could be checked using automated methods so should be 'O'? Further, both are directly related to "ontology hijacking", which is set as 'O'.
The example given has different flight numbers A123 & B123.

Based on the definitions and examples given for conciseness, I am confused by the definitions of intensional and extensional conciseness in the ff.: "While integrating data from two different datasets, if both use the same schema or vocabulary to represent the data, then the intensional conciseness is high. However, if the integration leads to the duplication of values, that is the same information is stored in different ways, this leads to extensional conciseness. " - should this not rather decrease conciseness?

For availability - the first two points in Table 5 have the same description

Volatility repeats the same sub-dimension in Table 7. Also, the definition for volatility needs to be properly referenced.

Table 8 is described as being split into three parts in the text - however the split is not shown.

Fig. 3 could do with some annotation and a more complete description in the text - I am not completely sure how to interpret it.

While I can see why the authors compare the three tools in S4.5 I'm not sure that that level of detail is needed - a summary of the amount of effort required by the end user and any drawbacks, and the pointer to the description in the source publication should be enough? Also, what were the criteria used to select them?

OTHER POINTS

p.11 - "It can be observed here that on the one hand the authors in [5] provide a formal definition whereas the author in [14] describes" - 5, as for 14, has a single author

Citations & References

Facts and claims must ALWAYS be supported with evidence, e.g., the size of the LOD is constantly changing - this needs both a citation and a timestamp.

Order (multiple) in-text citations numerically - especially important for such a long reference list.

[33] KITCHENHAM, B. Procedures for performing systematic reviews. - no date, nor publication venue

Acronyms in citations must use correct case, e.g., Rdf -> RDF

[5] is one of the most highly cited references in the paper, mostly on its own - it is however an (unpublished) PhD thesis. While I do not underestimate the value and breadth of doctoral research I'm not completely convinced that such weight should be assigned to this type of publication in a (state of the art) survey - measuring by formal weights assigned to publication types. Further, a survey needs to provide a much broader view on the literature - see also previous comments about the narrowness of the search criteria.
Another very heavily cited publication is [14] - this is a Master's thesis.

End of para 1, section 3.2.4.2 - the citations should be repeated rather than using former and latter - they are mentioned too far away for the reader to recognise easily what is being referred to.

LODGRefine should be referenced when first mentioned in S4.5

Presentation & Language

"Relations between dimensions." for each of the quality dimension sections is not obviously a summary of the relations for the sub-topics - this needs to be formatted as another header - it appears to fall under the last sub-topic addressed in each case - especially hidden within the dense text. Actually, the labeling seems to be done in some cases...

Is there a reason "interlinking" (S3.2.3.4.) only uses lowercase - in Table 4 and the other headings in this section?

Convention places table captions at the top.

"fitness of use" - I believe the term is "fitness FOR use"; "vice-a-versa" -> "vice versa"

For consistency, all acronyms or abbreviations should be expanded at first use - e.g., URI is not less well known than RDF in the SW community - defining the former and not the latter is inconsistent.

Quite a few typos & grammatical errors in the text will be caught by an auto-check and proof-read

Review #3

By Peter Haase submitted on 16/Mar/2013

Suggestion:
Major Revision

Review Comment:

The article provides a survey of quality assessment methodologies for Linked Data / Linked Open Data. Data quality is a very important and often underestimated problem in for the use and adoption of Linked Data. As such, the article addresses a very important topic.
The article is the result of a very extensive, detailed and comprehensive study, performed as a joint activity between two research groups. It has the potential to become a seminal reference for data quality in this community.

However, in the current form there are a number of problems.

The main problem is the observation that most of the content actually does not concern the comparison of the quality assessment methodologies, but rather the discussion of quality dimensions. In fact, only around seven of the 33 pages deal with the comparison of the methodologies, and from these seven pages more than half stay on a meta-level, discussing again the dimensions and the comparison criteria rather than the actual evaluation methodologies. There is a lot of detail provided about the process how the approaches have been selected. The actual presentation and comparison of the methodologies is limited to three tools, treated on two of the 33 pages. The information for the other approaches is on such an aggregate level that the article can only serve as a reference to the primary publications.
Further, the comparison matrix is so sparsely populated that one might question either the quality dimensions or the selection of approaches. Specifically, more than half of the selected approaches cover only one of the dimensions, of those most cover only provenance, where I would actually question whether this is a quality dimension (as discussed below).

The pre-selection of considered methodologies/approaches appears to be very “mechanic”, based on defined set of keywords issued against search engines.
Why did you not consult experts on the topic for identifying relevant work?

The discussion of the quality dimensions may be a contribution on its own. However, most of the dimensions are not new and well known from standard literature.
The presentation differs from previous presentations in that attempts to “formalize” and define terms and quality measures, but this attempt somewhat missed its target. Many of the quality measure definitions are too vague to be called definitions. E.g., in definitions the terms accuracy, correctness and precision are used almost interchangeably. It is hard to come up with proper definitions, but I would actually prefer a less formal presentation than improper definitions.
The important terms of a data set and data source remain rather vague, their relationship unclear (Section 3.1), while the notion of an RDF triple and graph is very formally defined, although its definition is not used and needed. Later on, the notion of a (RDF-)document is used (in the definition of validity), but this term is not defined.

There are a few new quality dimensions, but those are partially problematic. E.g., one of the new dimensions is provenance. It is not clear why provenance would be a quality dimension on its own. It is clear that provenance can be used to assess quality dimensions like reliability, but that does not make it a quality dimension.

There are some dimensions that are specific to Linked Data, but these in fact are the most problematic ones. E.g. validity conflates many different problems from syntactic errors to schema conformance. The definition of interlinking refers to the degree that entities referring to the same concept link to each other, but completely disregards the data sets (i.e. whether the links are within or across data sets). The definition of consistency refers to the data set a knowledge base, but the relationship between RDF data sets and logical inference based e.g. on some ontology language is not properly discussed. The quality dimensions of accessibility again conflate quality of a data set vs. quality of an endpoint, while the terms have not been properly distinguished.

In summary, there is a lot of useful and substantial material in the article. For publication, it requires a slight restructuring, a better motivation of the contributions (quality dimensions vs. actual comparison) and some improvements and clarifications in the definitions.

Minor problems:
You state the study/review was conducted by “two reviewers”. Why reviewers, and what is the relationship with the authors? (I assume they are the two first authors, but perhaps this can be made explicit.)

You inconsistently talk about quality for Linked Data vs. Linked Open Data. In the context of quality, this a very important distinction, which is not really considered in the metrics.

You inconsistently use the phrases “fitness of use”, “fitness for use”, “fitness to use”

Figure 3 is unclear: What is the semantics of the edges in the graph?

I would suggest introducing the use case scenario (from Section 2.1) earlier in the article, it does not logically fit into the middle of the presentation of the quality dimensions.

There are a number of approaches that come to my mind that I would have considered relevant, especially regarding the quality dimensions that consider the data sets as a knowledge base):
Denny Vrandecic: Ontology Evaluation, PhD Thesis, KIT.
Holger Lewen: Facilitating Ontology Reuse Using User-Based Ontology Evaluation, PhD Thesis, KIT

Log in or register to post comments
20109 reads

Main menu

Editorial Board

Syndicate

Quality Assessment Methodologies for Linked Open Data

Tracking #: 414-1533

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles

Search form

Main menu

Login

Editorial Board

Syndicate

Quality Assessment Methodologies for Linked Open Data

Tracking #: 414-1533

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles