Using LLMs for Semantic Alignment: A Study on Archival Metadata Description

Tracking #: 3843-5057

Authors: 
Maria Ioanna Maratsi
Charalampos Alexopoulos
Yannis Charalabidis

Responsible editor: 
Guest Editors 2025 OD+CH

Submission type: 
Full Paper
Abstract: 
The advantages of aligning custom data schemas with standardised ontologies within their respective knowledge domain have long since been proven in practice. Sharing a common structural representation by mapping concepts and relationships between the schemas is essential to ensure data interoperability (especially on a semantic level), integration, reuse, and the ability to leverage machine-processable and advanced-search capabilities. Archival institutions preserve, manage, and provide access to large amounts of diverse cultural and historical data, demonstrating a high potential to be active contributors to a global knowledge network, should archival data be transformed and offered as linked (open) data. Based on the expert-validated dataset of the mapping (alignment) of the Swedish National Archives schema to the Records-in-Contexts (RiC-O) ontology, the purpose of this study is two-fold. First, to examine whether it is possible to automatically and effectively extend one case (Sweden) to other archival institutions and align new custom schemas to RiC-O, given an expert-curated dataset of this domain. Secondly, using the aforementioned dataset and one more of a few human-evaluated examples of mapping to other cultural heritage ontologies as input, to examine whether an LLM (e.g., GPT-4o) is capable of recommending meaningful alignments for enhanced metadata description to more ontologies within the same domain (CH and archives), but also across other domains. The experiments reveal several challenges and shortcomings of the LLM prompting approach for these tasks, but also possible opportunities to leverage towards this direction.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 03/Jun/2025
Suggestion:
Major Revision
Review Comment:

The manuscript conducts an analysis of GPT4o's effectiveness in ontology alignment, using a case study involving the Swedish archives.

While this kind of studies is relevant, GPT4o may not constitute the best choice, given the presence of options with better chances for reproducibility. From my perspective, the authors might be aiming to demonstrate LLMs' general capabilities through testing GPT4o.

Note that according to the journal guidelines for authors, any endeavor involving LLMs and generative algorithms should evaluate both the energy consumption and carbon emissions of the proposed method, as well as the development efforts it entailed. I haven´t seen this information in the paper.

First it seemed that the study was based on a previously provided alignment, considered here as a truth measure, as stated in the Method session. However, in the evaluation session the authors mention:
´The outputs of experiment 1 were initially human-evaluated by the authors, who were considered
to be domain experts in this case´. Were they not compared to the previously given alignment?
Were the previous alignment used as groundwork or as a truth measure? This is not clear.

Also, the previous alignment seems fit to the first experiment, but it does not seem the case for the second one with LOV.
About this second experiment the authors mentioned in the introduction that previous alignment served ´to assist in recommending alignments with other schemas of the Cultural Heritage´, later in the evaluation session the observation is ´It is meaningless to judge the result in the same way as Experiment 1´

I think these issues should all be clarified in the Introduction, Method and Evaluation sessions.

The results are given in terms of R,P,F. Tables with examples are presented. I consider the examples are valuable, however in the given presentation format, the reading is difficult. Perhaps they should be presented in landscape orientation to facilitate reading.

The reader might deduce the size of the test data sets from the confusion matrix, but the lack of a clear presentation imposes this work on them. Experiment 2 is said to include a total of 69 cases. A better organized description of the data used in the experiments would help the reading.

Also regarding the data, RIC-O's understanding is reflected in the example tables, while RA is only succinctly described. Moreover, the context of the heritage is insufficiently established.

The presentation of the prompting process could be expanded.

Figures 7 and 8 mappings are difficult to interpret, instead of textual format for their explanation, tables could be used.

Regarding related work, it mentions some previous attempts of using LLMs for ontology alignment but previous findings are not correlated with the finding of the authors. The similarities and differences of previous studies with the present one are not commented.

It seems that in areas where the outputs are not so apt for liberty of expression, these models, based on general world knowledge and free language training, cannot perform so well. The kind of study presented in this paper are important to emphasize such limitations. Would RAG techniques be able to improve this?

Review #2
Anonymous submitted on 25/Jul/2025
Suggestion:
Major Revision
Review Comment:

The submitted manuscript studies whether (a) recent large language models (LLMs) can be exploited for mapping archival metadata schemas to standard ontologies and (b) suggesting vocabulary re use across domains. The paper builds on a prior, expert validated mapping between the Swedish National Archives (RA) schema and the Records in Context ontology (RiC O). Two experiments are conducted.

In the first experiment, a mapping between a custom archival schema to RiC O is attempted using GPT 4o and GPT 4.5. This is evaluated under a “zero-shot” approach (no additional information in the prompt) and a more “informed” setup, where an expert-provided mapping is provided to the LLMs as an example. The output of the process is evaluated by human experts and tagged accordingly. Metrics precision, recall and F1-scores are computed based on the human expert provided assessment of individual mappings (TP/FP/TN/FN). The informed setup provides (expectedly) better results, approximately 65% accuracy.

In the second experiment, the LLM is requested to recommend re-use of vocabularies from cultural heritage settings as well as other domains. Again, the results provided by the LLM are evaluated by human experts and the results are tabulated and visualized.
Experiments are followed by a discussion on the limitations of LLM based mapping. In this context, focus is placed on the results concerning hallucinated properties (deemed to be high) and the fact that in many cases the LLM does not suggest any true negatives. While accuracy, precision and F1-scores are moderate, the authors conclude that the use of the LLM can provide some initial material for human experts to work on. Limitations of the proposed approach are also discussed, including the size of the context window and the design of prompts; these are highlighted as future work.

The topic of the paper, i.e. the utilization of LLMs for performing ontology alignment in archival metadata, is well aligned with the scope of the journal and also current. Furthermore, it focuses on an existing challenge of the semantics community with theoretical and practical implications, i.e. applying FAIR principles and practices to archival data. The authors survey the existing literature on ontological mapping and archival linked data in a satisfactory fashion. The experiments provide interesting insight on the capabilities of LLM to assist in the aforementioned tasks, yet it would be highly desirable to provide comparative results against the performance of other existing tools, in order to provide a more comprehensive view of the performance of the LLM-based approach and its benefits and shortcomings. Tools that could be considered towards this direction include AML (https://github.com/AgreementMakerLight/AML-Project) and LogMap (https://github.com/ernestojimenezruiz/logmap-matcher).

Using multiple datasets in the evaluation experiments would strengthen the representativeness and the generalizability of the results. The authors should consider using larger datasets as supplemental ones, since the presented ones consist of approximately 70 elements, a size that is modest. More complete information needs to be provided concerning the number of examples provided in experiment #2, while a discussion on how the result of the experiment was affected by the examples provided would be rather useful.

The authors provide the data in appendixes, contributing to open data and open science. It would be best, however, to store these data in a more processable format (separate, appropriately formatted and machine-processable files) in a public, stable repository (e.g. zenodo or figshare). Any code used in the context of the experiment would be a welcome addition to the dataset.

The use of a public domain LLM would also increase the transparency and stability of the results. Taking into account that this is not a trivial task, this is recommended to be considered in future work. LLAMA 3 is a potential candidate towards this direction.
In the same line, using and assessing alternative prompt formulations would be an important aspect. The authors note this direction as future work, however the provision of some preliminary insight would be a welcome addition.

LLMs are known to include some level of randomness in their replies. Repeating the experiments multiple times and expanding the evaluation to take into account the stochastic aspects of the responses is strongly recommended.

The analysis process could be expanded to include a more detailed classification of errors, e.g. wrong class, wrong relation. Currently, this is only applied to hallucinations.

The organization of the paper is adequate. The manuscript needs to be carefully proofread to improve the quality of English, including breaking down long sentences to smaller ones. If the data and results are moved to a stable online repository as suggested above, the manuscript text could focus on some important/interesting examples, making the text shorter and more focused.

Minor comments:

The statement “no suggestion was given and no suggestion could be valid” could be changed to “no suggestion was given and human experts did not provide any suggestions either”.

Figure 7 is blurry and should be replaced with a more crisp version.

The format of the paper needs to be improved in terms of formatting consistency, while some tables should be presented in landscape mode to increase readability.

Review #3
Anonymous submitted on 10/Sep/2025
Suggestion:
Reject
Review Comment:

The paper presents an LLM Semantic Alignment evaluation. The authors use an LLM (GPT-4o) to make evaluations of two scenarios where the injected context varies.

Strengths

* The research scenario addresses a trending area: the use of LLMs (e.g., GPT-4o, GPT-4.5) to support decision-making by injecting contextual knowledge.
* The general problem of semantic alignment in archives and knowledge management is relevant and timely.

Weaknesses

* Manuscript formatting
+ The manuscript does not follow the expected SAGE layout for submissions.
+ Two different templates/layouts appear (up to page 9, and from page 10 onward, recurring at pages 15, 22, etc.).
+ References, tables, and figures are not presented in the required format.
* Supplementary materials
+ The supplemented resources do not contribute directly to the manuscript.
+ It is highly recommended that the appendix and external resources be deposited in an open archive (e.g., Zenodo, GitHub) in the appropriate format.
* Compliance with CFP requirements: The manuscript does not comply with the "Generative Artificial Intelligence carbon footprint" requirement, explicitly requested in the call for papers.
* Stylistic and consistency issues
+ In Section 2 (Background) and Section 3.1 (Prior Knowledge), citation styles are inconsistent: "Research Claim by Author et al. (Year)[##]" vs. "Research Claim [##]" A single, consistent style is necessary for cohesion.
+ Reference 4 is empty.
* Typos and wording issues:
+ "archival contexts and connectability" → connectivity
+ "The promp interaction" → prompt
+ "During the 0-shot" → zero-shot

Recommendation

Reject in the current state.

The manuscript is relevant but does not meet formatting, style, and compliance requirements. A full restructuring, consistent use of references, correction of typographical errors, and compliance with CFP.