Abstract:
Linked Data (LD) enables cultural heritage institutions to refine archival descriptions and improve findability, but manually creating LD descriptions remains labor-intensive. This paper presents an ontology-guided information extraction system that assists archivists by automatically identifying concepts and relations in digitized archival records. Focusing on Portuguese archival collections, we extract and structure data according to ArchOnto, a CIDOC-CRM-based LD model for archives, to support future metadata enrichment. Our approach identifies core archival entities from textual digital representations of archival records obtained through optical character recognition and human-made transcriptions. However, it shows limited results in extracting some entities and relational facts. Our low-performing results indicate that fine-tuning information extraction models using adapted general-domain datasets for Cultural Heritage tasks in 20th-century documents is only marginally viable.