Generation of Training Data for Named Entity Recognition of Artworks

Tracking #: 2966-4180

Nitisha Jain
Alejandro Sierra
Jan Ehmueller
Ralf Krestel

Responsible editor: 
Special Issue Cultural Heritage 2021

Submission type: 
Full Paper
As machine learning techniques are being increasingly employed for text processing tasks, the need for training data has become a major bottleneck for their application. Manual generation of large scale training datasets tailored to each task is a time consuming and expensive process, which necessitates their automated generation. In this work, we turn our attention towards creation of training datasets for named entity recognition (NER) in the context of the cultural heritage domain. NER plays an important role in many natural language processing systems. Most NER systems are typically limited to a few common named entity types, such as person, location, and organization. However, for cultural heritage resources, such as digitized art archives, the recognition of fine-grained entity types such as titles of artworks is of high importance. Current state of the art tools are unable to adequately identify artwork titles due to unavailability of relevant training datasets. We analyse the particular difficulties presented by this domain and motivate the need for quality annotations to train machine learning models for identification of artwork titles. We present a framework with heuristic based approach to create high-quality training data by leveraging existing cultural heritage resources from knowledge bases such as Wikidata. Experimental evaluation shows significant improvement over the baseline for NER performance for artwork titles when models are trained on the dataset generated using our framework.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Marijn Koolen submitted on 10/Jan/2022
Minor Revision
Review Comment:

This revised version of the paper improves on many of the issues raised previously. The authors added more detail and nuance about the experimental setup, the artwork category and its connection to other categories, and the potential popularity bias. The authors now also include a link to the code and trained models.

As I mentioned in my previous review, this paper makes a good contribution by discussing the value and challenges of NER for artworks and providing datasets and models. The one remaining point for improvement is the limited discussion of their choice to exclude one word titles in terms of the kind of bias it creates and how that potentially affects users of its output.

Specific comments:

P. 8: On the previous version of this paper I asked the authors to discuss the consequences of removing one word titles from the annotation data. In this revision, the authors mention that only 5% of titles are affected in this step. I appreciate their elaboration, but although that is a low percentage, they still don't discuss the implications of this systematic exclusion that creates systematic biases in the ground truth dataset. Thinking about using the output of a NER tagger trained on this data, I wonder what users' response would be if they are told that the available titles exclude all artworks with a one word title.

P. 12, footnote 16: It would be good to include the SpaCy and Flair version numbers as well as the names and versions of the specific trained models that were used, as improved versions--especially of pre-trained models--are released regularly. The requirements.txt in the GitHub repository only has the SpaCy and Flair version numbers, not the models that were re-trained. If it takes up too much space in the paper, refer to the repo README and add the details there.

P. 12, col. 1: "was configures" -> "was configured"

P. 14, sec. 6.1: the addition of the smaller sample sizes is very useful, as it shows that the curves are rapidly stabilising (as is to be expected), although the SpaCy model under relaxed conditions, seems to keeps improving more strongly, particularly on precision. This suggests there is value in increasing the training set size, but also that at the size of the current dataset, it is mainly getting better at roughly spotting where titles are mentioned, but much less at identifying the exact titles.

Review #2
Anonymous submitted on 11/May/2022
Minor Revision
Review Comment:

The authors present a framework using the heuristic-based approach for generating high-quality training data with the help of existing cultural heritage resources from knowledge bases such as Wikidata.

The work is interesting and much improved as compared to the previous version.

I have a few comments for the authors:

- In section 2.2, the authors can also refer to and compare to [1].

- In section 3 the challenges are very well motivated and explained.

- For figure 3, can it be made more depictive the method. I understand the figure better by only looking at the steps on the right-hand side as compared to the image + the caption of the image. The caption can be made more descriptive.

Detailed comments:

"This work was first introduced in 2019 as a short paper at the 23rd International Conference on Theory and Practice of Digital Libraries [34]" --> Rephrase to something like "This work is an extended version of [34]"