Instance Level Analysis on Linked Open Data Connectivity for Cultural Heritage Entity Linking and Data Integration

Tracking #: 2707-3921

Authors: 
Go Sugimoto

Responsible editor: 
Guest Editors KG Validation and Quality

Submission type: 
Full Paper
Abstract: 
In cultural heritage, many projects execute Named Entity Linking (NEL) through global Linked Open Data (LOD) references in order to identify and disambiguate entities from their local datasets. It allows us to obtain extra information and contextualise the data with it. Thus, the aggregation and integration of heterogeneous LOD are expected. However, such development is still limited partly due to data quality issues. In addition, analysis on the LOD quality have not sufficiently been conducted for cultural heritage. Moreover, most research on the data quality concentrates on ontology and corpus level observations. This paper examines the quality of the eleven major LOD sources used for NEL in cultural heritage with an emphasis on in-stance level connectivity and graph traversals. Standardised linking properties are inspected for 100 instances/entities in order to create “traversal maps”. Other properties are also assessed for quantity and quality. The outcomes suggest that the LOD is not fully interconnected and centrally condensed; the quantity and quality are unbalanced. Therefore, they cast doubt on the possibility to automatically identify, access, and integrate known and unknown datasets. This implies the need for LOD improvement, as well as the NEL strategies to maximise the data integration.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Joe Raad submitted on 04/Jun/2021
Suggestion:
Major Revision
Review Comment:

This paper presents an analysis on the quality of 11 major datasets that are related, but not restricted, to the cultural heritage domain (e.g. DBpedia, Wikidata, Wikipedia). This analysis is based mainly on the manual inspection of 100 entities from five different categories (agents, events, dates, places, and concepts), with a total of 859 IRIs being examined from these 11 datasets. The author analyses the traversal maps of these instances, and provides some statistics regarding the centrality and connectivity of these instances, when aggregated to the dataset level.

(Originality and significance of the results):
The author mentions in the introduction that the following research question is pursued in this paper: "The author examines the connectivity of the LOD instances through lookups. In other words, what level of information could we find, when a local dataset links to them, and what kind of data aggregation and integration could be possible by following the links?".

Despite the high amount of details provided in the paper, I don't think that the paper managed to answer this research question in full, especially the part regarding "what level of information we can find when linking?". Indeed, the author studies the connectivity of the datasets, by analysing the group of instances from each of the five categories, but did not manage to show the benefits of linking to any of these datasets. This can be done for instance, by analysing the quantity and quality of (new) information that are gained for a certain dataset A, when a dataset A links to a dataset B. Such study can of course take in consideration the connectivity of datasets that is studied in this work, as dataset A can as well benefit from all the links originating from dataset B to other datasets, and follow these links. I believe that this type of analysis could have been the added value of this paper, as merely studying the connectivity of 11 datasets based on a sample of 100 entities has little impact, as it only confirms observations conducted in previous (larger-scale) studies, which the author states in his conclusion. In addition, even though the author explicitly states that they want to bypass the complicated discussion regarding the semantics of the relations used for linking (which I completely agree), one still has to make the minimum distinction in the study between the different types of relations used for linking, as there is an important difference between two IRIs linked by an rdfs:seeAlso and two IRIs linked by an owl:sameAs statement.

(Quality of writing):
Other than the limited contributions of the work, I think that the paper suffers from major presentation issues, highlighted mostly by the amount of repetitive and redundant details in Section 4 which massively impact the quality of the paper. Here are some of these presentation issues:
1) Repetitive experiments: sections 4.1 until 4.6 (which represent more than quarter of the paper when including their figures) are basically the same analysis, but just conducted on different instances
2) Redundant details: in my opinion, around half of the details in Section 4 are better fit to be included in a technical report that the author can refer to in the paper
3) Frequent presence of numbers and percentages within the text
4) Quality of the figures and the lack of self-descriptive captions
5) Difference of 5 to 6 pages between the text and the location of the figures referred to in the text.
6) Constant mix between the RDF/XML syntax and the RDF standard

I think that the problem of studying the quality of datasets that is addressed in this submission is of great importance, not only for the cultural heritage domain, but for the Web of Data in general. However, I think that the analyses conducted in this work might have limited impact, despite the important manual effort conducted by the author, which I think could be worth much more. I strongly recommend that the author aims in a re-submission at providing more insights for answering the research question that they stated in the introduction, by going beyond the connectivity and network analysis that they conducted on this sample of 100 entities. As I previously mentioned, the author can try to materialise the important manual effort done in this work, for analysing the impact and the importance of the existing links between the major datasets, in answering certain user questions in the cultural heritage domain (i.e. SPARQL queries). Such type of research is mentioned by the author as part of the challenges in the conclusion of the paper, but is not investigated in the paper. In addition, and since the author has pointed out the absence of a gold standard, I suggest that they aim to publish the 859 instances with their 10,474 links, as part of a new gold standard for evaluating the quality of datasets in the Web of Data. I suggest to make the type of links clear for the user (whether it is an owl:sameAs link or rdfs:seeAlso). It is very appreciated that the author has already made his scripts and data available on Zenodo, but I think in their current state it is slightly difficult for another user to make use of these files.

---

More specific comments can be found below.

- Section 1 provides a nice introduction for the paper, but I think it can be improved by better highlighting the research question pursued in this work and the list of contributions.

- Section 2 covers a good portion of the related work. I suggest to mention the following paper by Guéret et al., where the authors investigated the use of a metric called "description richness" for assessing the quality of identity links. This metric measures how much to the description of a resource is added through the use of sameAs edges (see section 3.5 of this paper), which is mentioned in the research question of this work, but not investigated (i.e. what level of information we can find when linking):

Christophe Guéret, Paul Groth, Claus Stadler, and Jens Lehmann. "Assessing linked data mappings using network measures." In Extended semantic web conference, pp. 87-102, 2012.

- Regarding Section 3, I don't understand why the author is making the RDF/XML syntax as a minimum requirement for choosing a dataset in the study. All the other serialisation formats (e.g. Turtle, N-triples, N-Quads) are valid RDF data as well, and most RDF libraries are able to process RDF data independently from their syntax, and easily convert from one syntax to another.

- Section 4 is very difficult to read, and repetitive.

* This section contains too many numbers, and this confuses the reader since most of the times these values cannot be easily found in the Tables or Figures. I think in certain cases, it would have been preferable to use a matrix (11 x 11) to show all the number of connections between the datasets, including the self-links, and only highlight the interesting numbers in the text.

* The caption of figures are most of the times not self-descriptive. For example, the caption of Figure 17 is "The percentage of 4 properties against rdf:resource". For Figure 3, what is a node/edge? Does each node represent all the identifiers coming from a certain dataset? And if that's the case, let's suppose that only one of the 100 studied entities has its Wikidata IRI linked to its YAGO IRI, while no other links exist between Wikidata and YAGO for the remaining 99 entities, does this mean that there is a path between the Wikidata and YAGO nodes in this Figure?

* All Figures referred to from Section 4 are located at least 5 pages later in the paper. For example, in page 9 the author refers to Figure 4, that is located on page 15. The same applies for most of the figures in the paper, which makes reading this section more difficult.

* I have difficulties understanding the study conducted in section 4.8. I think there is a major confusion here between the syntax (e.g. RDF/XML that uses rdf:about and rdf:resource) and the RDF standard.

- For Section 5, I strongly suggest to have some titles for each different point of discussion addressed in the conclusion section. Long texts are hard to read and the message can get easily lost.

Some minor comments and typos:
- Page 3: 11dimensions -> 11[empty space]dimensions
- Page 4: A small introduction for section 3 would be appreciated
- Page 4: s/he follows -> they follow
- I suggest to use a simpler syntax than RDF/XML for the examples (e.g. turtle or N-Triple)
- Page 4: it is highly important that the end users need to obtain -> can obtain?
- It is more common to spell out numbers smaller than 10. Example Page 5: the importance of the 4 core questions -> of the four core questions
- Page 8: It is not clear how the following statement confirms that DBpedia and YAGO are tightly connected: "Regarding the interactions of the data sources, DBpedia holds 8035 incoming and 5832 outgoing links, while YAGO does 273 and 2713 links respectively, confirming that the two data sources are tightly connected".
- Page 9: rather than their quantities of the links -> rather than the amount of links
- Page 9: [32] stress... -> Idrissou et al. [32] stress... (same applies for several similar cases)
- Page 21: rdf:reseource -> rdf:resource

Review #2
By Miel Vander Sande submitted on 12/Jul/2021
Suggestion:
Minor Revision
Review Comment:

This paper performs a detailed analysis on Linked Open Data sources that are commonly used in the cultural heritage sector for Named Entity Linking.
According to the authors, these institutions invest in recognising entities in textual descriptions and linking them to public sources (NEL) like Geonames or Wikidata to support data-driven research exercised in the digital humanties.
Specifically, they hope to enable the exploration of global connected knowledge - as Linked Open Data promises - by contextualising their local records with these external knowledge bases and being able to traverse from one source to another. The authors argue, however, that in order to achieve this, these sources need a high interconnectivity between instances, which they study in this work as a perspective on data quality.
Rather that a statistical analysis. they perform a detailed and hand-crafted analysis on the connectivity of instances between eleven popular LOD sources and assess the extent in which one can traverse (‘explore’) the compound graph. Their main methodology comprises of constructing ‘traversal maps’ of these sources: a matrix of the connections between the sources using the ‘link properties’ owl:sameAs, rdfs:seeAlso, schema:sameAs, skos:exactMatch and analysing these in a more qualitative manner. The paper is an example of case study research, something we don’t see often in the Semantic Web/Linked Data domain.

Overall, I found the paper an interesting read because of its detailed approach and narrow scope. Its zooms in on a very specific case, but really studies it in-depth. The methodology is carefully explained and well motivated. Attention was given to the selection of the sources and the representative entity sample in those sources (although the sample size is small side, it’s ok for a case study like this). There are some important issues that were identified, like the interpretations of certain predicates or the expertise of these LOD sources. That said, there are some shortcomings that need to be addressed or explained. I have serious doubts about the premise on which the authors based this research. First of all, I am not convinced you need perfect interconnectedness or even strive for it. On several occasions, the authors suggest that high data quality corresponds with a maze network. However, that would be very hard to scale. How many sources do you link to and for how long? The fact that sources can evolve independently and thus in some cases have no in or outgoing links, is essential for linked data to scale at all. A situation with link hubs such as sameAs.org closing those gaps, which was unfortunately not included in the analysis, is far more realistic. To that extent, I find that the paper overstates the importance of connectedness between LOD sources for cultural heritage institutions doing NEL. Sure, it’s very important and a higher quality would help, but there are other uses of linked entities to support research that are more valuable like data enrichment or a (federated) querying approach. Also, it assumes a very naive way of doing NEL that only links to a single sources, while most frameworks allow you to connect your entity to multiple sources and therefore connect the sources themselves. I suggest the authors reexamine the positioning of the impact of their work.

Another issue is that some sections/paragraphs in the paper are written in an overly complex manner and are almost incomprehensible. I suggest to simplify the language overall and try to minimise the verbosity. Examples:
-Although the quantity problem needs to be addressed eventually, it is rather an economic and political issue; therefore, this paper investigates the data quality issues from a technical point of view that are discussed to a lesser extent in the cultural heritage community.
-These terms can generally be grouped together as a task to identify, disambiguate, and extract entities from data and to reconcile and make references to the same or matching entities in another data.
- NEL fosters contextualisation of information to facilitate data integration.
- Apart from Mountantonakis and Tzitzikas, macro research projects for the linking quality oftentimes investigate data sources (or corpora) as a whole especially in the sense of owl:sameAs. The data connectivity is examined regardless of the user mobility at an instance level.
- …

I was also wondering why the authors only focus on instance similarity and did not take into account other predicates that are used with objects from external sources? Especially because the most interesting semantics are captured in that way.

Some more detailed comments per section:

Introduction
- the intro feels a bit bloated and should be written in a more concise, to-the-point manner.
- when mentioning that computer science research has been working on data quality issues, the authors should mention what issues they are referring to.
- the concept of ‘following the links’ can be interpreted in many ways; it should more clearly defined in the text

Related work
- it’s not clear what the authors consider graph traversal or graph traversal research

Methodology
- I suggest adding a short description of a traversal map when the term is mentioned for the first time
- using ‘the author’ to refer to the authors is strange; maybe replace it with a simple ‘we’?
- although I get the argument for using RDF/XML, I would still use turtle syntax for the sake of readability
- Is there a reference for RDF Beams?
- can the authors add a reference for the claim ‘ the entities are confusingly organised and hidden from the mainstream con-tents, especially in aggregated LOD’?
- what does ‘best-effort’ mean in the context of lookups?

Linked Open Data Analysis
- in 4.7, can’t the undirected weighted graph with link strength be simulated in order to use [32]
- the last paragraph on page 22 is a good example of subjective language that should not be part of the analysis section. It’s better to move such interpretations to the conclusion.

Conclusion
- from a use perspective, is it the lack of connectedness or the lack of completeness that causes the data quality issue? I’d argue the latter.

for this scalability to work.

Review #3
By Herminio Garcia-Gonzalez submitted on 22/Sep/2021
Suggestion:
Major Revision
Review Comment:

This paper describes a study on graph traversals in 11 major LOD providers and how them could affect data quality and integration in the field of cultural heritage. The author argues that many cultural heritage projects are running NEL to link to these major LOD resources and, therefore, it is useful to know how well they are connected among them which would derive in a integrated and quality source of information.

I found the results of this study very interesting as it put numbers and examples to a well-known problem in the Semantic Web, namely, the interlinking between different projects. In this regard, these results are worth to be published alongside the conclusions which derive from it. However, I see some drawbacks that should be fixed before its publication, specifically, in the introduction and motivation of the experiment.

Firstly, the introduction does not justify very well the use of NEL and cultural heritage as a motivation for graph traversals. It reads more like I did this experiment and I justify it by using some striking terms and it should be definitively the other way round. In addition, one could argue that NEL is important in Cultural Heritage but also in other fields and, in the same way, graph traversals are not only important in CH but also in the whole LOD ecosystem which is said by the own author in the last paragraph of the paper. Therefore, I encourage the author to support better the use of these terms and fields in the introduction and motivation section.

About data integration, the author seems to trust all the data integration metrics to the linking between different datasets. However, I see data integration as a much broader term. A number of datasets could be well integrated if they came from heterogeneous formats and then they end up in the same format so the user can query them trough a single interface or format. I would rather see author's use of data integration term as data linking (or record linkage). If the author wants to use data integration term like this a proper citable definition should be clearly provided.

Likewise, data quality is not measured only by data linkage. Data quality depend on more attributes like having all instances of a type following the same schema (using same types for the same attributes, using the attributes semantic in the same way, etc.). Thus, I see the same problem that in data integration while using this term in the paper.

In Section 3.3 RDF Beams is mentioned but there is no link to the source code. In order to ensure the reproducibility of the experiment the author should provide a link to the code.

There are some figures that are never mentioned in the text. They should be all linked in the text.

Graphics like in Fig. 9 seem to be unuseful. I think that a simple table would be much more understandable than this graphic. Therefore, it should be changed to a table or to a more informative graphic.

I think that endnotes section with a listing containing a SPARQL query to obtain the Top 20 list should be placed closer to where it is mentioned.

Some typos per section:
# Introduction
of the inventor of the Web -> of the Web inventor
computer science communities working in this issue in recent years -> computer science communities working in this issue in the recent years (also it would be good to include a cite for this claim.)
The following section describes -> Section 3 describes

# Methodology
none of previous studies -> none of the previous studies
Wikidata and DBpedia are derived from Wikipedia (I thought that DBpedia was derived but Wikidata not...)

# Linked Open Data Analysis
Jesus are the lowest number of links -> Jesus has the lowest number of links
emerged as the second with 83 link -> emerged as the second with 83 links
the lowest entities are surprisingly -> the lowest entities are surprisingly:
the total number of rdf:reseource -> the total number of rdf:resource

# Conclusion
that next generation search engines -> the next generation of search engines