Editorial Board

Editor-in-Chief
Cogan Shimizu
Eva Blomqvist

Editorial Board
Mehwish Alam
Claudia d’Amato
Stefano Borgo
Boyan Brodaric
Philipp Cimiano
Michael Cochez
Oscar Corcho
Bernardo Cuenca-Grau
Elena Demidova
Jerome Euzenat
Sebastián Ferrada
Mark Gahegan
Aldo Gangemi
Dagmar Gromann
Armin Haller
Pascal Hitzler
Aidan Hogan
Katja Hose
Eero Hyvönen
Krzysztof Janowicz
Sabrina Kirrane
Agnieszka Lawrynowicz
Freddy Lecue
Maria Maleshkova
Raghava Mutharaju
Axel Polleres
Guilin Qi
Marta Sabou
Harald Sack
Angelo Salatino
Christoph Schlieder
Stefan Schlobach
Cogan Shimizu
Blerina Spahiu
Sanju Tiwari
GQ Zhang
Rui Zhu

Former/Founding Editors-in-Chief
Krzysztof Janowicz
Pascal Hitzler

Editorial Assistants
Michael McCain

Syndicate

Ontology-based Information Extraction from Cultural Heritage Digital Representations: A Case Study in Portuguese Archives

Submitted by Mariana Dias on 03/07/2026 - 06:04

Tracking #: 4041-5255

Authors:

Mariana Dias

Carla Teixeira Lopes

Responsible editor:

Guest Editors 2025 OD+CH

Submission type:

Tool/System Report

Abstract:

Linked Data enables cultural heritage institutions to refine archival descriptions and improve access, but manually creating such descriptions remains labor-intensive. Automating information extraction from digitized records for Linked Data descriptions generation addresses this problem. However, complex or resource-intensive information extraction pipelines are often impractical for resource-constrained archival institutions. This paper identifies and evaluates the most effective methods for ontology-aligned information extraction from Portuguese archival collections under low-resource conditions. We compare early sequence labeling architectures (fine-tuned) with transformer-based zero-shot models for entity and relation extraction aligned with ArchOnto, an ontology designed for the archival domain and based on CIDOC-CRM, an ontology for the cultural heritage domain. Models were evaluated on general-domain Portuguese datasets and domain-specific 20th-century Portuguese archival texts consisting of Optical Character Recognition (OCR)-extracted text and corresponding human-made transcriptions. Our results highlight the challenges of extracting information from noisy archival texts, where BiLSTM-CRF-based named entity recognition models achieved solid performance, while GLiREL produced poor relation extraction results, limiting reliable ontology-guided triple extraction.

Full PDF Version:

swj4041.pdf

Previous Version:

Ontology-based Information Extraction from Cultural Heritage Digital Representations: A Case Study in Portuguese Archives

Tags:

Reviewed

Long-term Stable Link to Resources:

https://github.com/MarianaFerrDias/ArchExtract

Decision/Status:

Solicited Reviews:

Click to Expand/Collapse

Review #1

Anonymous submitted on 19/Apr/2026

Suggestion:
Accept

Review Comment:

This manuscript was submitted as 'Tools and Systems Report'. The authors followed and addressed all of the reviewers’ comments point by point and submitted a much more detailed and comprehensive draft of the article. In particular,
(1) Quality, importance, and impact of the described tool or system
The manuscript effectively addresses a pressing need in the field of cultural heritage, by combining ontology and automated information extraction for archival collections in non-English languages, where these tasks are less well-established. It reuses ArchOnto, an ontology designed for the archival domain and based on CIDOC-CRM, which is the reference ontology for the cultural heritage domain.
(2) Clarity, illustration, and readability o f the describing paper
To improve clarity, the authors separated the “Background” and “Related Work” sections, as well as the “Dataset Generation” and “Results” sections, and added sections on “Methodology” and “Mapping Extracted Information to Linked Data”. Furthermore, the authors clarified in the Discussion how the proposed model addresses the research questions explicitly stated in the Introduction. They also provided more than twice as many tables and figures with additional data to support their arguments. The bibliography has been expanded to reflect the state of the art and related works.
In conclusion, the proposal appears convincing and ready for publication.

Review #2

Anonymous submitted on 30/Apr/2026

Suggestion:
Major Revision

Review Comment:

First of all, the authors have made substantial efforts to address the previous comments. The revised manuscript is significantly improved in terms of structure, clarity, and completeness.

However, a justification for excluding LLM-based approaches. The rationale provided is not entirely convincing. Given the strong performance of LLMs in recent information extraction tasks, a more comprehensive discussion in the manuscript and an experimental comparison is expected (even for a small part of the datasets). At the same time, some RE results (<10%) are too weak to support strong conclusions about end-to-end pipeline performance (e.g., versus an LLM approach)

Therefore, this issue should be further clarified and strengthened before the manuscript can be considered for publication. For this reason, I recommend Major Revision.

Review #3

Anonymous submitted on 27/May/2026

Suggestion:
Minor Revision

Review Comment:

This paper addresses a relevant problem in the CH domain, meaning the automatic extraction and creation of linked data descriptions under low-resource conditions. Authors evaluate information extraction (NER and RE) models on non-English retrodigitized documents with OCR noise, aligning extracted information with ArchOnto. In this revised version, fine-tuned BiLSTM-CRF models are compared against zero-shot transformer-based models in different settings.

Strengths
The problem addressed is particularly relevant for GLAM institutions, which usually operate with limited computational resources and challenging documents (non annotated, non-English, noisy).
The paper is well-structured and the pipeline is clearly described. Compared to the previous version, this revision shows improvements: results are reported more extensively, the dataset creation process is transparent, and models of different architectures and sizes are evaluated.

Weaknesses
The main limitation of this paper (acknowledged by the authors) is the limited size and composition of the domain-specific evaluation data. Ner.spec.OCR and Ner.spec.Human datasets only comprise 13 documents and a limited set of entity types and relations, which make results hard to generalize. Additionally, in the RE task there is a lack of baseline, since only the GLiREL model is evaluated due to insufficient training data. As such, the RE-related aspects of RQ1, RQ4 and RQ5 are not fully explored and the RQs might need to be reshaped. Finally, I would like to underscore the fact that the choice of GLiNER’s entity labels (e.g., replacing Group with Organization) may have influenced model’s predictions and results.

Minor concerns
- The list of of contributions in the Introduction skips contribution 2).
- The future work section would benefit from greater specificity.
- Typos in Sec 6.2.1 and 6.2.2: words are missing a final 's' in a few places (e.g. 'we evaluated the model…', 'of extracting relation between…').

Log in or register to post comments
627 reads

Main menu

Editorial Board

Syndicate

Ontology-based Information Extraction from Cultural Heritage Digital Representations: A Case Study in Portuguese Archives

Tracking #: 4041-5255

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles

Search form

Main menu

Login

Editorial Board

Syndicate

Ontology-based Information Extraction from Cultural Heritage Digital Representations: A Case Study in Portuguese Archives

Tracking #: 4041-5255

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles