Ontology-based Information Extraction from Cultural Heritage Digital Representations: A Case Study in Portuguese Archives

Tracking #: 3912-5126

This paper is currently under review
Authors: 
Mariana Dias
Carla Teixeira Lopes

Responsible editor: 
Guest Editors 2025 OD+CH

Submission type: 
Tool/System Report
Abstract: 
Linked Data (LD) enables cultural heritage institutions to refine archival descriptions and improve findability, but manually creating LD descriptions remains labor-intensive. This paper presents an ontology-guided information extraction system that assists archivists by automatically identifying concepts and relations in digitized archival records. Focusing on Portuguese archival collections, we extract and structure data according to ArchOnto, a CIDOC-CRM-based LD model for archives, to support future metadata enrichment. Our approach identifies core archival entities from textual digital representations of archival records obtained through optical character recognition and human-made transcriptions. However, it shows limited results in extracting some entities and relational facts. Our low-performing results indicate that fine-tuning information extraction models using adapted general-domain datasets for Cultural Heritage tasks in 20th-century documents is only marginally viable.
Full PDF Version: 
Tags: 
Under Review