Abstract:
Linked Data enables cultural heritage institutions to refine archival descriptions and improve access, but manually creating such descriptions remains labor-intensive. Automating information extraction from digitized records for Linked Data descriptions generation addresses this problem. However, complex or resource-intensive information extraction pipelines are often impractical for resource-constrained archival institutions. This paper identifies and evaluates the most effective methods for ontology-aligned information extraction from Portuguese archival collections under low-resource conditions. We compare early sequence labeling architectures (fine-tuned) with transformer-based zero-shot models for entity and relation extraction aligned with ArchOnto, an ontology designed for the archival domain and based on CIDOC-CRM, an ontology for the cultural heritage domain. Models were evaluated on general-domain Portuguese datasets and domain-specific 20th-century Portuguese archival texts consisting of Optical Character Recognition (OCR)-extracted text and corresponding human-made transcriptions. Our results highlight the challenges of extracting information from noisy archival texts, where BiLSTM-CRF-based named entity recognition models achieved solid performance, while GLiREL produced poor relation extraction results, limiting reliable ontology-guided triple extraction.