Generation of Training Data for Named Entity Recognition of Artworks

Tracking #: 2766-3980

Authors: 
Nitisha Jain
Alejandro Sierra
Jan Ehmueller
Ralf Krestel

Responsible editor: 
Special Issue Cultural Heritage 2021

Submission type: 
Full Paper
Abstract: 
As machine learning techniques are being increasingly employed for text processing tasks, the need for training data has become a major bottleneck for their application. Manual generation of large scale training datasets tailored to each task is a time consuming and expensive process, which necessitates their automated generation. In this work, we turn our attention towards creation of training datasets for named entity recognition (NER) in the context of the cultural heritage domain. NER plays an important role in many natural language processing systems. Most NER systems are typically limited to a few common named entity types, such as person, location, and organization. However, for cultural heritage resources, such as digitized art archives, the recognition of fine-grained entity types such as titles of artworks is of high importance. Current state of the art tools are unable to adequately identify artwork titles due to unavailability of relevant training datasets. We analyse the particular difficulties presented by this domain and motivate the need for quality annotations to train machine learning models for identification of artwork titles. We present a framework with heuristic based approach to create high-quality training data by leveraging existing cultural heritage resources from knowledge bases such as Wikidata. Experimental evaluation shows significant improvement over the baseline for NER performance for artwork titles when models are trained on the dataset generated using our framework.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Marijn Koolen submitted on 07/May/2021
Suggestion:
Minor Revision
Review Comment:

This paper presents an approach to building a large-scale annotated dataset for training NER taggers on the domain of cultural heritage (CH) objects and specifically on recognising titles of artworks.

This is a well-written and clearly structured paper that is easy to follow. The authors identify cultural heritage objects as a broad category that has received little attention in Fine-grained NER research and for which no or few NER training datasets exist. Moreover, they provide a realistic test case of a heterogeneous set of OCR'ed documents, that contain text-recognition errors and from which most structural/layout information has been removed. This is a common situation that provides additional challenges that cannot easily be brushed aside by focusing on a small set of highly curated and correct documents.

Beyond the challenges of working with digitised CH texts, the authors do a good job of highlighting and detailing the specific challenges and kinds of errors made in recognising names in the domain of CH objects.

The paper is well-embedded in a large body of recent and older literature. I particularly like that the authors discuss explicitly how this work extends one of their previous contributions. This is really helpful to readers who are familiar with the previous work.

The evaluation shows how the three stages influence the overall performance and the authors included an analysis of how the size of the dataset affects performance, which I wish more authors would do, as it gives an indication of the cost-benefit trade-off of putting in additional annotation effort. To get an even better impression of this, it would be useful if the authors included even smaller fractions (e.g. 5, 10, 15, 20%), as that would probably make the overall shape of the curve clearer. I wouldn't call this a required revision, but definitely a way to make the contribution even stronger.

The main issue I have with this paper is with how they define the scope of their problem. The chosen focus on artworks to include only paintings and sculptures prompts several questions. Why only focus on those types of artworks? And is it preferable to use this broad label for a narrow interpretation of the category of artwork to include only these few types, or to use a different category label for this narrower subset (e.g. would 'visual artwork' be a more appropriate label)? To what extent does the answer to the previous question depend on the domain and types of texts that are to be NER'ed?

The public datasets from Wikidata and Getty are probably biased towards more popular CH objects, so probably skews the NER tagger towards the titles of these more popular objects, leaving out or underperforming on titles of the long tail of less well-known objects. This is not a criticism of the paper, just a question out of interest (I don't expect the authors to tackle all the problems of NER in the CH domain in one go). Is there any reason to believe the annotated set of titles is or is not representative of artwork titles in general? I would like the authors to add some reflection on whether they think this bias is present and causes a problem for recognising artworks in general, and if so, what the consequences of this bias are and if there are possible ways to deal with this.

Overall, this paper makes a clear contribution towards domain-specific fine-grained NER, with consequences for any downstream tasks that rely on it, but which can be improved with some minor revisions.

One comment on the fact that this paper is submitted to a special issue of SWJ. Although the authors use Wikidata and Getty datasets as sources of titles, the connection with SW in this paper is not so clear, as the focus is on NLP/NER problems and solutions.

Finally, the paper makes no mention whether the annotated dataset and the trained models are or will be made available.

Strong points:
- the focus on the CH domain for NER addresses domain-specific problems and categories and results in a training dataset that is a valuable resource for further research and for use in enriching and linking CH collections.
- the authors discuss the specific challenges in a clear and structured way.

Points for improvement:
- the exclusion on of many types of artworks needs to be discussed as well as the consequences of using a broad category label for a narrow definition.
- the criteria used in filtering the list of titles to include in the annotated dataset in the first of the three stages need to be argued for and their consequence discussed (preferably with statistics on how each step reduces the size of the dataset and how it qualitatively changes the nature of the dataset).

Specific comments:

- p. 4, col. 2, lines 45-46: "to generate high-quality training corpus" -> "to generate a high-quality training corpus"
- p. 6, col. 1, lines 43-44: "that can be found in dictionary" -> "that can be found in a dictionary"
- p. 7, col. 1, lines 37-38: "refine annotations for artwork named entity." -> "refine annotations for artwork named entities."
- p.7, col. 2, lines 26-29: The authors remove artwork names consisting of a single word, as many these are highly generic words. This is a very significant step, but the consequences are not discussed in much detail. I can imagine that a significant number of one-word titles are highly uncommon or at least unique to the artwork. The authors should discuss why the problem with generality of words is tackled by focusing only on the length of the title (a single word) and not on the commonness of the single title word. Moreover, it would be useful to provide statistics on how many/what fraction of artworks are removed in each filtering step, to make the consequences clearer.
- p. 8, col. 1, line 23: "by expert user community" -> "by an/the expert user community" or "by expert user communities"
- p. 8, col. 2, line 45: "in entity dictionary" -> "in the entity dictionary"
- p. 8, col. 2, lines 45-46: "was maintained as spans " -> "were maintained as spans"
- p. 9, col. 1, lines 23-24: "with the help of set of labelling functions and patterns" -> "a set of" or "sets of"
- p. 10, col. 1, lines 26-27: "referring an artwork" -> "referring to an artwork"
- p. 10, col. 2, lines 26-27: "After filtering out English texts and performing initial" -> This sounds as though the English texts are removed from the dataset, although the authors earlier on indicated that they focus on English. If that is the case, I think it would be clearer to say "After removing all non-English texts ..."
- p. 10, col. 2, lines 44:49: It seems as though the authors are focusing on paintings and sculptures, yet refer to them with the broader title of "artwork" and comment on the OntoNotes5 category of "work_of_art" as including many other things beyond paintings and sculptures. I think this needs to be discussed more clearly in the introduction or section 3, e.g. what types of entities do the authors include and exclude from the category of artwork in this task, and why. Wikipedia and several dictionaries consider "artwork" and "work of art" to be synonyms and to include all these types of objects. Novels, films, musical pieces and video games are also artworks and cultural heritage objects, so it seems the chosen focus on paintings and sculptures requires some discussion as to whether, and if so, how, they are different from other types of artworks. Example 6 in Table 5, where the name of a novel is tagged as an artwork, would be interesting to incorporate in that discussion. To compare the challenges and issues with e.g. identifying book titles, see refs [1-3] below (full disclosure, I'm co-author on [1]).
- p. 11, col. 1, lines 28-29: "on Ontonotes5 dataset" -> "on the Ontonotes5 dataset"
- p. 11, col. 1, lines 46-47: "an NER framework in form of" -> "a NER framework in the form of"
- p. 12, col. 2, lines 32-33: "in semi-automated manner" -> "in a semi-automated manner"
- p. 13, col. 2, line 50: "of annotation dataset" -> "of the annotation dataset"
- p. 14, col. 2, line 28: "including artwork" -> the sentence structure suggests this should be " including artworks"
- p. 14, col. 2, line 29: "a few examples texts" -> "a few example texts"
- p. 14, col. 2, line 38: "texts that needs" -> "texts that need"
- p. 15, col. 1, line 23: "in semi-automated manner" -> "in a semi-automated manner"
- p. 15, col. 1, line 46: "on existing knowledge graph" -> "on existing knowledge graphs"
-

References:

[1] Bogers, T., Hendrickx, I., Koolen, M., & Verberne, S. (2016, September). Overview of the SBS 2016 mining track. In Conference and Labs of the Evaluation forum (pp. 1053-1063). CEUR Workshop Proceedings.

[2] Ollagnier, A., Fournier, S., & Bellot, P. (2016). Linking Task: Identifying Authors and Book Titles in Verbose Queries. In CLEF (Working Notes) (pp. 1064-1071).

[3] Ziak, H., & Kern, R. (2016). KNOW At The Social Book Search Lab 2016 Suggestion Track. In CLEF (Working Notes) (pp. 1183-1189).

Review #2
Anonymous submitted on 19/May/2021
Suggestion:
Accept
Review Comment:

(1) Quality, importance, and impact

The motivation behind the need of training data for named entity recognition of artworks and its relevance are clearly described in the paper with the limits of the current NER systems well explained.

It is possible to see from the results of the retrained systems how the work done in this paper is effective, and that the approach used to generate training data can potentially improve the state of the art (and potentially be expanded to other domains).

Given that the focus of the paper is on the methodology used to create training data it would be interesting to add some information to quantify the quality of the results obtained from the three stages framework, as it could be the inter-annotator agreement of the data extracted from the framework with respect to the human annotators (i.e. on the 544 entries of the test set that are already manually annotated)

I don't know if it was intended but there is no link to the demo. It would be useful to have it in the paper.

(2) Clarity, illustration, and readability

The paper is well-structured, the language is good and the discourse is easy to follow.
The figures are clear and useful.
I really appreciated that every case/error presented in the paper was followed by an example.

Just few things to check:

Page 1 line 44: are restricted to only on a few -> are restricted to only a few
Page 4 line 46: a framework to generate high-quality training corpus in a scalable.. -> do you mean corpora?
Page 5 Table 1: the sum of the ratios is not 1
Page 9 line 23: with the help of set of labelling functions -> of a set of
Page 12 line 32: approach to generate such annotations in semi-automated manner -> in a semi-automated manner
Page 13 line 41: from the Text 2 the title... -> from Text 2
Page 14 line 36: such as in the Text 6 -> such as in Text 6

Review #3
By Marieke van Erp submitted on 24/Oct/2021
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

The paper presents an approach and experiments to the creation of training data for named entity recognition in the cultural heritage domain, specifically for recognising titles of artworks. Overall the paper is well-written and provides a useful use case for the domain. The paper could be improved by some explanations on some of the experimental design choices and the analyses and some ideas on how this approach can be generalised to other entity types in the cultural heritage domain.

Some questions I had when reading it which I think should be clarified:
- Nested entities are mentioned as a problem in Section 3, but it is not clear how they are treated, it would be good to explain this. In Subsection 3.4 it is not clear who the annotator is in 'is not tagged as any named entity at all'.
- The reduction in sentences and annotations caused by Snorkel should be explained better for readers not familiar with Snorkel.
- Section 5: was a comparison with a fine-grained NER system such as 13 or 14 not an option? This would be more realistic than using a regular NER system.
- The IAA is quite low, was there a training and/or adjudication phrase in the annotation process?
- Section 6: When the system had difficulty correctly detecting boundaries and artwork types, were these also the cases that the annotators didn't agree on/had difficulty with?
- Section 7: What is meant by 'a user-friendly' interface: this really depends on the intended users and it does not seem that the interface has been evaluated
- Some relevant papers:
Freire, Nuno, José Borbinha, and Pável Calado. "An approach for named entity recognition in poorly structured data." In Extended Semantic Web Conference, pp. 718-732. Springer, Berlin, Heidelberg, 2012.
Meroño-Peñuela, Albert, Ashkan Ashkpour, Marieke Van Erp, Kees Mandemakers, Leen Breure, Andrea Scharnhorst, Stefan Schlobach, and Frank Van Harmelen. "Semantic technologies for historical research: A survey." Semantic Web 6, no. 6 (2015): 539-564.

Text comments:
Section 1
"in recent years" -> add citation
[12], etc. --> remove 'etc'
Section 3:
What do Count and Ratio refer to in Table 1?
WPI)^4, -> WPI),^4
In order to -> To
"we encountered many additional errors" -> such as and how were these dealt with? what is their impact?
Section 4
for artwork -> for an artwork
we focus on English texts -> mention that English is the working language earlier (for example the introduction)
1075 0-> 1,075
languages were obtained -> languages was obtained
by expert user community -> by an expert user community
Table 2 is better right aligned
Section 5
Spacy -> spaCy
For generating a test dataset -> To generate a test dataset
task^13, -> task,^13
References:
Check the references again, for example for 6 and 18 the first author's name is "Erik" and his last name is "Tjong Kim Sang", LaTeX needs some help with this.