Review Comment:
This paper presents an analysis on the quality of 11 major datasets that are related, but not restricted, to the cultural heritage domain (e.g. DBpedia, Wikidata, Wikipedia). This analysis is based mainly on the manual inspection of 100 entities from five different categories (agents, events, dates, places, and concepts), with a total of 859 IRIs being examined from these 11 datasets. The author analyses the traversal maps of these instances, and provides some statistics regarding the centrality and connectivity of these instances, when aggregated to the dataset level.
(Originality and significance of the results):
The author mentions in the introduction that the following research question is pursued in this paper: "The author examines the connectivity of the LOD instances through lookups. In other words, what level of information could we find, when a local dataset links to them, and what kind of data aggregation and integration could be possible by following the links?".
Despite the high amount of details provided in the paper, I don't think that the paper managed to answer this research question in full, especially the part regarding "what level of information we can find when linking?". Indeed, the author studies the connectivity of the datasets, by analysing the group of instances from each of the five categories, but did not manage to show the benefits of linking to any of these datasets. This can be done for instance, by analysing the quantity and quality of (new) information that are gained for a certain dataset A, when a dataset A links to a dataset B. Such study can of course take in consideration the connectivity of datasets that is studied in this work, as dataset A can as well benefit from all the links originating from dataset B to other datasets, and follow these links. I believe that this type of analysis could have been the added value of this paper, as merely studying the connectivity of 11 datasets based on a sample of 100 entities has little impact, as it only confirms observations conducted in previous (larger-scale) studies, which the author states in his conclusion. In addition, even though the author explicitly states that they want to bypass the complicated discussion regarding the semantics of the relations used for linking (which I completely agree), one still has to make the minimum distinction in the study between the different types of relations used for linking, as there is an important difference between two IRIs linked by an rdfs:seeAlso and two IRIs linked by an owl:sameAs statement.
(Quality of writing):
Other than the limited contributions of the work, I think that the paper suffers from major presentation issues, highlighted mostly by the amount of repetitive and redundant details in Section 4 which massively impact the quality of the paper. Here are some of these presentation issues:
1) Repetitive experiments: sections 4.1 until 4.6 (which represent more than quarter of the paper when including their figures) are basically the same analysis, but just conducted on different instances
2) Redundant details: in my opinion, around half of the details in Section 4 are better fit to be included in a technical report that the author can refer to in the paper
3) Frequent presence of numbers and percentages within the text
4) Quality of the figures and the lack of self-descriptive captions
5) Difference of 5 to 6 pages between the text and the location of the figures referred to in the text.
6) Constant mix between the RDF/XML syntax and the RDF standard
I think that the problem of studying the quality of datasets that is addressed in this submission is of great importance, not only for the cultural heritage domain, but for the Web of Data in general. However, I think that the analyses conducted in this work might have limited impact, despite the important manual effort conducted by the author, which I think could be worth much more. I strongly recommend that the author aims in a re-submission at providing more insights for answering the research question that they stated in the introduction, by going beyond the connectivity and network analysis that they conducted on this sample of 100 entities. As I previously mentioned, the author can try to materialise the important manual effort done in this work, for analysing the impact and the importance of the existing links between the major datasets, in answering certain user questions in the cultural heritage domain (i.e. SPARQL queries). Such type of research is mentioned by the author as part of the challenges in the conclusion of the paper, but is not investigated in the paper. In addition, and since the author has pointed out the absence of a gold standard, I suggest that they aim to publish the 859 instances with their 10,474 links, as part of a new gold standard for evaluating the quality of datasets in the Web of Data. I suggest to make the type of links clear for the user (whether it is an owl:sameAs link or rdfs:seeAlso). It is very appreciated that the author has already made his scripts and data available on Zenodo, but I think in their current state it is slightly difficult for another user to make use of these files.
---
More specific comments can be found below.
- Section 1 provides a nice introduction for the paper, but I think it can be improved by better highlighting the research question pursued in this work and the list of contributions.
- Section 2 covers a good portion of the related work. I suggest to mention the following paper by Guéret et al., where the authors investigated the use of a metric called "description richness" for assessing the quality of identity links. This metric measures how much to the description of a resource is added through the use of sameAs edges (see section 3.5 of this paper), which is mentioned in the research question of this work, but not investigated (i.e. what level of information we can find when linking):
Christophe Guéret, Paul Groth, Claus Stadler, and Jens Lehmann. "Assessing linked data mappings using network measures." In Extended semantic web conference, pp. 87-102, 2012.
- Regarding Section 3, I don't understand why the author is making the RDF/XML syntax as a minimum requirement for choosing a dataset in the study. All the other serialisation formats (e.g. Turtle, N-triples, N-Quads) are valid RDF data as well, and most RDF libraries are able to process RDF data independently from their syntax, and easily convert from one syntax to another.
- Section 4 is very difficult to read, and repetitive.
* This section contains too many numbers, and this confuses the reader since most of the times these values cannot be easily found in the Tables or Figures. I think in certain cases, it would have been preferable to use a matrix (11 x 11) to show all the number of connections between the datasets, including the self-links, and only highlight the interesting numbers in the text.
* The caption of figures are most of the times not self-descriptive. For example, the caption of Figure 17 is "The percentage of 4 properties against rdf:resource". For Figure 3, what is a node/edge? Does each node represent all the identifiers coming from a certain dataset? And if that's the case, let's suppose that only one of the 100 studied entities has its Wikidata IRI linked to its YAGO IRI, while no other links exist between Wikidata and YAGO for the remaining 99 entities, does this mean that there is a path between the Wikidata and YAGO nodes in this Figure?
* All Figures referred to from Section 4 are located at least 5 pages later in the paper. For example, in page 9 the author refers to Figure 4, that is located on page 15. The same applies for most of the figures in the paper, which makes reading this section more difficult.
* I have difficulties understanding the study conducted in section 4.8. I think there is a major confusion here between the syntax (e.g. RDF/XML that uses rdf:about and rdf:resource) and the RDF standard.
- For Section 5, I strongly suggest to have some titles for each different point of discussion addressed in the conclusion section. Long texts are hard to read and the message can get easily lost.
Some minor comments and typos:
- Page 3: 11dimensions -> 11[empty space]dimensions
- Page 4: A small introduction for section 3 would be appreciated
- Page 4: s/he follows -> they follow
- I suggest to use a simpler syntax than RDF/XML for the examples (e.g. turtle or N-Triple)
- Page 4: it is highly important that the end users need to obtain -> can obtain?
- It is more common to spell out numbers smaller than 10. Example Page 5: the importance of the 4 core questions -> of the four core questions
- Page 8: It is not clear how the following statement confirms that DBpedia and YAGO are tightly connected: "Regarding the interactions of the data sources, DBpedia holds 8035 incoming and 5832 outgoing links, while YAGO does 273 and 2713 links respectively, confirming that the two data sources are tightly connected".
- Page 9: rather than their quantities of the links -> rather than the amount of links
- Page 9: [32] stress... -> Idrissou et al. [32] stress... (same applies for several similar cases)
- Page 21: rdf:reseource -> rdf:resource
|