Review Comment:
This paper provides a survey of literature on the topic of Linked Data quality. The paper introduces a survey methodology to locate and narrow down to 21 in-scope peer-reviewed papers that deal with the issue of Linked Data quality. The authors then propose a categorisation scheme of Linked Data quality "dimensions" that summarises and organises those treated by the in-scope literature. The goal here is to consolidate the literature -- which talks about the same or similar issues in different ways -- into a consistent nomenclature and conceptualisation. This broad conceptualisation of Linked Data quality is then discussed in detail, constituting an overview of the selected material. Thereafter, the authors present a comparison of the selected papers/proposals in terms of the quality dimensions covered, the granularity of data covered, the provision of tools, and the mode of evaluation (manual; semi-automatic; automatic).
The paper is clearly on topic for this journal and tackles a very important and non-trivial issue for the Semantic Web/Linked Data community. The survey is, to the best of my knowledge, novel in its breadth and depth. I do, however, have a few concerns about the paper and in particular with a lack of clarity in how it conveys its ideas. I'll continue with the formal criteria for reviewing a survey paper as per the CfP.
(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
The paper constitutes a comprehensive survey and distils the relevant material into a coherent framework. In general, I like how the authors set out to provide the survey. However, in terms of execution of the paper, even being quite familiar with issues relating to Linked Data quality, I myself found the discussion confusing in many parts and difficult to intuitively grasp. Primarily, there is a lot of conceptual overlap between various dimensions presented and I found it difficult to distinguish what makes a particular criterion unique, or in some cases, how a particular dimension relates to Linked Data quality. I think the conceptualisation needs to be tightened up a lot more and discussed better if it is to serve as an introductory text.
Taking one small example, in Section 3.2, the authors list Completeness, Amount-of-data and Relevancy. I really fail to see three distinct criteria here. Completeness deals with the issue of there being sufficiently complete data within a scope to satisfy a particular task. Relevancy encompasses the issue of the data being important and not containing too much “fluff”. Amount-of-data sits between the two but is not distinguished from either. Having looked through the section a couple of times, I fail to see what makes Amount-of-data a unique criterion here. The paper states “In essence, this dimension conveys the importance of not having too much unnecessary information”. But how is that different from Relevancy? Or Conciseness?
Throughout the paper, I encountered this issue again and again. I'm reading through the criteria but it's not at all clear to me how a specific criterion relates to the others. Quite a few seem redundant. The authors partially tackle this problem by introducing "Relation between dimensions" at the end of the section. However, for me (i) this is too late, in that I want the distinction to be naturally made before/as I read the criteria, not afterwards; (ii) this text is not always sufficient to distinguish why there are different criteria; (iii) the text is often confusing and involves cyclical justifications. As an example of (iii): “For instance, if a dataset is incomplete for a particular purpose, the amount of data is insufficient. However, if the amount of data is too large, it could be that irrelevant data is provided, which affects the relevance dimension.” This does not clarify or justify the dimensions for me at all.
To be more explicit, here's *some* categories I found redundant or unclear as to their distinction:
* Amount-of-data (vs. Completeness/Relevancy)
* Conciseness vs. Relevancy
* Consistency (wrt. Validity)
* Believability vs. Reputation
* Provenance as a distinct quality issue (vs. maybe Completeness)
* Response time (wrt. Performance)
* Representational-conciseness (vs. Conciseness)
* Representational-consistency vs. Interpretability
* Volatility as a quality issue (Is volatility as defined good or bad? Or is this again a Completeness issue? That information about Volatility is provided?)
* Timeliness as a data quality issue (vs. a query engine issue)
I appreciate that Linked Data quality is a complex issue, but that's all the more reason for the authors to make their conceptualisation as simple as possible (and no simpler). There are various pieces of text in the paper that try to distinguish the different criteria, but they are either too vague or too difficult to intuit/understand.
As a side-effect of this seeming redundancy, it seems to me that the authors partially fail in consolidating different proposals by different papers under the one nomenclature: if I look at some of the tables, I see that pretty much identical issues in different papers are listed under different categories where I personally cannot draw any distinction. For example, in Table 4, there are a lot of duplicated issues presented under "Validity-of-documents" and "Consistency" (e.g., use of undefined classes and properties, ill-typed literals, etc.). This problem occurs in a number of places.
The running example is very useful (and important), but I am concerned that, at times, it is too far removed from the question of Linked Data quality and takes too many liberties. For example, sometimes it is not clearly distinguished whether the quality issue in question is a function of the data being published, or whether the quality issue is an artefact of the system itself (cf. Timeliness).
Finally, I feel that parts of Section 4 are really weak and don't contribute anything of significance. In particular, Section 4.5 is quite long without providing much of note. Having read it, I'm still unclear as to what the tools actually *do*, or how that relates to facets of Linked Data quality. This subsection should certainly be shortened (or greatly improved/clarified).
Returning to the CfP text, I could imagine a PhD student or similar, after reading this paper, severely struggling to explain what the conceptualisation of Linked Data quality is to another student, or to distinguish the different dimensions. At least I know that I would struggle to do so having read the paper. Furthermore, I think the student would struggle to say what the individual surveyed papers have done.
(2) How comprehensive and how balanced is the presentation and coverage.
The paper seems comprehensive and pretty well balanced. The authors clearly set out the methodology and scope of the literature survey, which I appreciate. Since this is a survey, there are a couple of other notable works I can suggest as worth a mention (even if not included in the detailed comparison):
* General SW quality discussion:
Vrandecic: "Ontology Evaluation". Phd Thesis. http://www.aifb.kit.edu/images/b/b5/OntologyEvaluation.pdf
* Interlinking/Accuracy
Halpin et al.: "When owl: sameAs Isn't the Same: An Analysis of Identity in Linked Data."" International Semantic Web Conference (1) 2010: 305-320
* Representational consistency/Interpretability
Ding & Finin: "Characterizing the Semantic Web on the Web." ISWC, pages 242–257, November 2006.
* Completeness/Currency
Möller et al.: "Learning from Linked Open Data usage: Patterns & metrics." In WebScience 2010, 2010.
(3) Readability and clarity of the presentation.
The paper is well written in most parts but not well written in other parts. I think it is vital for a paper like this (whose sole contribution is the text itself) to be *very* well written. It certainly starts strongly. However, there are various awkwardly constructed paragraphs and confusing statements throughout the middle of the paper (mostly in Section 3), which hinder its comprehension. Furthermore, I think the formatting of tables could be improved, and I don't like the use of numbered references for a survey like this (I keep asking, which is paper number X again?). I try to provide an *incomplete* list of minor comments along these lines at the end of the review. As well as addressing these, I strongly encourage the authors to go through the paper again and try to sharpen the writing throughout.
Also, I note a few inaccuracies in the paper. I understand that summarising the content of 21 papers is a difficult task, but based purely on the papers I've personally been involved in (for which this was easy for me to recognise), I found that the attributions were muddled in parts. For example, in Table 5, [26] has nothing to do with SPARQL that I know of. Other such errors I noticed are mentioned in the minor comments. It is of course vital that this paper provides proper attribution to the correct references, so I encourage the authors to go through these again.
Finally, the decision to include Figure 4 (a snapshot of I'm not really sure what in the German language) is a very curious one. It should at least be in English but I suggest just remove it altogether.
(4) Importance of the covered material to the broader Semantic Web community.
The material is clearly very important. Though currently flawed, I feel the paper has the obvious potential to be a very important contribution to the community. The authors just need to invest a good bit more time and effort tightening up the conceptualisation and the writing.
High-level suggested improvements:
* Revise and hopefully simply the presented dimensions. At the very least, the dimensions must be clearly and unambiguously distinguished from each other.
* Consolidate similar measures across different papers in the various tables under one dimension.
* Make sure that the examples clearly relate back to *Linked Data quality*, and in particular the quality of underlying *data*.
* Improve the writing of the paper throughout (including, but not limited to, minor comments), particularly as this is a survey paper and the writing is the contribution!
* A possible route to improve the readability of the paper is to instead use *one example* at the *start* of each subsection. For example, at the start of 3.2.2, for Trust dimensions, introduce an example for trust that covers all of the relevant dimensions and how they are distinct from each other. This example will provide, up front, the intuition and distinction behind (i) what the dimensions are, (ii) how they are distinct, (iii) how they are related, including to previously discussed dimensions for other categories. The examples should ideally fit together in such a way to give an impression of progress while reading the paper, and an impression of “completeness” when finished. This is just a suggestion, but I think a really coherent running example will improve the paper greatly.
* Consider adding the references above, even if considered out-of-scope for the main survey.
* Shorten Section 4, and in particular, Section 4.5.
* Double-check and correct attributions throughout.
* "Relations between dimensions" sections should clearly also *distinguish* between dimensions.
* Mostly I suggest tightening up the text. The paper should be shorter, not longer.
==== MINOR COMMENTS *INCOMPLETE* ====
Throughout:
* LOD = Linking Open Data (a W3C project)? Not Linked Open Data?
* You refer to papers as "In [42]" or "the authors of [27]" or so forth. Particularly as this is a survey, I'd prefer if author names are provided up front (or as part of the reference format, if allowed). This way, it will be much easier to track which paper is being talked about and the reader doesn't have to recall, e.g., which paper [42] is again. You do this already in some places, e.g., "Bizer et al. [6] ...".
* Mixed British and American English. On that, perhaps be slightly wary of verbatim quotes from papers without proper quoting mechanisms?
* "eg." Maybe "e.g."?
* I couldn't get my hands on text for [14]. Could you provide a link with the reference?
* "interlinking" Always small case. Maybe an erroneous find/replace?
* A number of entries in the tables don't have references. Were there any criteria by which you decide to extend the list of issues not mentioned in the focus papers?
* When formulas are presented in the tables, even with informal terms, I would prefer to see them typeset as formulas, or at least in math mode. For example "-" is not a minus symbol, it's a very short hyphen; "$-$" is a minus symbol. Maybe use $\frac{}{}$ as well? It seems like it would fit.
* "Flemming et al." should just be "Flemming"
Section 1:
* "Specifically, biological and health care data ..." It sounds like LOD only covers these domains. Perhaps state: "For example, biological ..."?
Section 2:
* "did non" -> "did not"
* Flemming’s thesis was peer-reviewed? Certainly it should be included, but by the same token, I think there are probably a few other theses that should also be mentioned, in particular Denny Vrandecic’s PhD thesis. In any case, just an observation.
Section 3:
* Is the definition of RDF triples/datasets useful? That notation is never used elsewhere? I would suggest to remove it.
* "... web-based information systems design which integrate" Rephrase
* "which are prone of the non ..." Rephrase
* "Thus, data quality problem ..." Rephrase
* "There are three dimensions ... in Table 2" Rephrase
* "in [42], the two types defined" Punctuate/rephrase.
* "(1) {a} propagation algorithm" -> "(1) a propagation algorithm"?
* "a publisher"
* "Verifiability can be measured either by an unbiased third party..." Rephrase
* "whether to accept {the} provided information"
* I'm not sure in Accuracy about the Paris example, or the decision to map inaccurate labelling/classification to Accuracy. That's more a consumer side problem, right? Paris (Texas) is still called Paris.
* Table 4: lots of duplication between issues for [14] and [27]
* "An RDF validator can be used to parse the RDF document ... in accordance with the RDF specification." I feel it's important to note here that RDF is a data-model and has many syntaxes. There is no one specification for which syntax to parse. My concern is that is sounds like there's one syntax (RDF/XML).
* "but inconsistent when taking the OWL2-QL reasoning profile into account." Why OWL 2 QL?
* "B123" Should be A123?
* Figure 3: Where's OWL 2 DL and OWL 2 Full? N3 is not RDF; it's a superset. Why include DAML? What is LD Resource Description? Would prefer to mention JSON-LD and not just generic JSON.
* 'uniqueness' -> `uniqueness'
* "it can be measured as"
* "be able to serve 100 simultaneous users"
* Table 5: First two issues are identical?
* Table 5: [26] has nothing to do with SPARQL?
* Table 5: "When a URI"
* Table 5: Would like to see "no dereferenced forward-links" (locally known triples where the local URI is mentioned in the subject) mentioned in kind.
* Avoiding the use of prolix RDF features seems a strange addition to scalability, though I roughly get the idea of why that connection is made. Perhaps conciseness or representational-conciseness are a better fit?
* "Our flight search engine should provide only that information related to the location rather than returning a chain of other properties." Again, this sounds like a problem with the search engine, not with the data. The data should not be expected to contain precisely all and only relevant data for each user query. It's up to the search engine to filter non-relevant information for a given request.
* "... RDF data in N3 format". Though not strictly incorrect, N3 is a superset of RDF. Better to say "... RDF data in Turtle format."
* Table 6: Nitpick. "detection of the non-standard usage of collections ..." [26] looks at non-standard usage of collections, containers or reification. [27] looks at *any* usage (since these features are discouraged from use by Linked Data guidelines).
* "turtle" -> "Turtle"
* "In order to avoid interoperability issue{s},"
* "Even though there is no central repository of existing vocabularies, ..." What about Linked Open Vocabularies? http://lov.okfn.org/dataset/lov/
* "in the spatial dataset<4>"
* Table 7: the footnotes appear on another page.
* "and it<'>s type"
Section 4:
* "We notice that most of {the} metrics ..."
* Footnote 23: http://swse.deri.org/RDFAlerts/. Hopefully it still works. :) By the way, why is it considered semi-automatic and not automatic?
* Figure 4: Remove or get an English view. I don't think it really serves any purpose here, though I don't speak German so ...
* Section 4.5: I really found it difficult to grasp what the tools (particularly for "Sieve") were doing and the section falls flat. I'd suggest to shorten this section (or improve it a lot).
* This list is very much incomplete! *
|