Ontology-based Information Extraction from Cultural Heritage Digital Representations: A Case Study in Portuguese Archives

Tracking #: 4041-5255

Authors: 
Mariana Dias
Carla Teixeira Lopes

Responsible editor: 
Guest Editors 2025 OD+CH

Submission type: 
Tool/System Report
Abstract: 
Linked Data enables cultural heritage institutions to refine archival descriptions and improve access, but manually creating such descriptions remains labor-intensive. Automating information extraction from digitized records for Linked Data descriptions generation addresses this problem. However, complex or resource-intensive information extraction pipelines are often impractical for resource-constrained archival institutions. This paper identifies and evaluates the most effective methods for ontology-aligned information extraction from Portuguese archival collections under low-resource conditions. We compare early sequence labeling architectures (fine-tuned) with transformer-based zero-shot models for entity and relation extraction aligned with ArchOnto, an ontology designed for the archival domain and based on CIDOC-CRM, an ontology for the cultural heritage domain. Models were evaluated on general-domain Portuguese datasets and domain-specific 20th-century Portuguese archival texts consisting of Optical Character Recognition (OCR)-extracted text and corresponding human-made transcriptions. Our results highlight the challenges of extracting information from noisy archival texts, where BiLSTM-CRF-based named entity recognition models achieved solid performance, while GLiREL produced poor relation extraction results, limiting reliable ontology-guided triple extraction.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 19/Apr/2026
Suggestion:
Accept
Review Comment:

This manuscript was submitted as 'Tools and Systems Report'. The authors followed and addressed all of the reviewers’ comments point by point and submitted a much more detailed and comprehensive draft of the article. In particular,
(1) Quality, importance, and impact of the described tool or system
The manuscript effectively addresses a pressing need in the field of cultural heritage, by combining ontology and automated information extraction for archival collections in non-English languages, where these tasks are less well-established. It reuses ArchOnto, an ontology designed for the archival domain and based on CIDOC-CRM, which is the reference ontology for the cultural heritage domain.
(2) Clarity, illustration, and readability o f the describing paper
To improve clarity, the authors separated the “Background” and “Related Work” sections, as well as the “Dataset Generation” and “Results” sections, and added sections on “Methodology” and “Mapping Extracted Information to Linked Data”. Furthermore, the authors clarified in the Discussion how the proposed model addresses the research questions explicitly stated in the Introduction. They also provided more than twice as many tables and figures with additional data to support their arguments. The bibliography has been expanded to reflect the state of the art and related works.
In conclusion, the proposal appears convincing and ready for publication.

Review #2
Anonymous submitted on 30/Apr/2026
Suggestion:
Major Revision
Review Comment:

First of all, the authors have made substantial efforts to address the previous comments. The revised manuscript is significantly improved in terms of structure, clarity, and completeness.

However, a justification for excluding LLM-based approaches. The rationale provided is not entirely convincing. Given the strong performance of LLMs in recent information extraction tasks, a more comprehensive discussion in the manuscript and an experimental comparison is expected (even for a small part of the datasets). At the same time, some RE results (<10%) are too weak to support strong conclusions about end-to-end pipeline performance (e.g., versus an LLM approach)

Therefore, this issue should be further clarified and strengthened before the manuscript can be considered for publication. For this reason, I recommend Major Revision.

Review #3
Anonymous submitted on 27/May/2026
Suggestion:
Minor Revision
Review Comment:

This paper addresses a relevant problem in the CH domain, meaning the automatic extraction and creation of linked data descriptions under low-resource conditions. Authors evaluate information extraction (NER and RE) models on non-English retrodigitized documents with OCR noise, aligning extracted information with ArchOnto. In this revised version, fine-tuned BiLSTM-CRF models are compared against zero-shot transformer-based models in different settings.

Strengths
The problem addressed is particularly relevant for GLAM institutions, which usually operate with limited computational resources and challenging documents (non annotated, non-English, noisy).
The paper is well-structured and the pipeline is clearly described. Compared to the previous version, this revision shows improvements: results are reported more extensively, the dataset creation process is transparent, and models of different architectures and sizes are evaluated.

Weaknesses
The main limitation of this paper (acknowledged by the authors) is the limited size and composition of the domain-specific evaluation data. Ner.spec.OCR and Ner.spec.Human datasets only comprise 13 documents and a limited set of entity types and relations, which make results hard to generalize. Additionally, in the RE task there is a lack of baseline, since only the GLiREL model is evaluated due to insufficient training data. As such, the RE-related aspects of RQ1, RQ4 and RQ5 are not fully explored and the RQs might need to be reshaped. Finally, I would like to underscore the fact that the choice of GLiNER’s entity labels (e.g., replacing Group with Organization) may have influenced model’s predictions and results.

Minor concerns
- The list of of contributions in the Introduction skips contribution 2).
- The future work section would benefit from greater specificity.
- Typos in Sec 6.2.1 and 6.2.2: words are missing a final 's' in a few places (e.g. 'we evaluated the model…', 'of extracting relation between…').