Review Comment:
The submitted manuscript studies whether (a) recent large language models (LLMs) can be exploited for mapping archival metadata schemas to standard ontologies and (b) suggesting vocabulary re use across domains. The paper builds on a prior, expert validated mapping between the Swedish National Archives (RA) schema and the Records in Context ontology (RiC O). Two experiments are conducted.
In the first experiment, a mapping between a custom archival schema to RiC O is attempted using GPT 4o and GPT 4.5. This is evaluated under a “zero-shot” approach (no additional information in the prompt) and a more “informed” setup, where an expert-provided mapping is provided to the LLMs as an example. The output of the process is evaluated by human experts and tagged accordingly. Metrics precision, recall and F1-scores are computed based on the human expert provided assessment of individual mappings (TP/FP/TN/FN). The informed setup provides (expectedly) better results, approximately 65% accuracy.
In the second experiment, the LLM is requested to recommend re-use of vocabularies from cultural heritage settings as well as other domains. Again, the results provided by the LLM are evaluated by human experts and the results are tabulated and visualized.
Experiments are followed by a discussion on the limitations of LLM based mapping. In this context, focus is placed on the results concerning hallucinated properties (deemed to be high) and the fact that in many cases the LLM does not suggest any true negatives. While accuracy, precision and F1-scores are moderate, the authors conclude that the use of the LLM can provide some initial material for human experts to work on. Limitations of the proposed approach are also discussed, including the size of the context window and the design of prompts; these are highlighted as future work.
The topic of the paper, i.e. the utilization of LLMs for performing ontology alignment in archival metadata, is well aligned with the scope of the journal and also current. Furthermore, it focuses on an existing challenge of the semantics community with theoretical and practical implications, i.e. applying FAIR principles and practices to archival data. The authors survey the existing literature on ontological mapping and archival linked data in a satisfactory fashion. The experiments provide interesting insight on the capabilities of LLM to assist in the aforementioned tasks, yet it would be highly desirable to provide comparative results against the performance of other existing tools, in order to provide a more comprehensive view of the performance of the LLM-based approach and its benefits and shortcomings. Tools that could be considered towards this direction include AML (https://github.com/AgreementMakerLight/AML-Project) and LogMap (https://github.com/ernestojimenezruiz/logmap-matcher).
Using multiple datasets in the evaluation experiments would strengthen the representativeness and the generalizability of the results. The authors should consider using larger datasets as supplemental ones, since the presented ones consist of approximately 70 elements, a size that is modest. More complete information needs to be provided concerning the number of examples provided in experiment #2, while a discussion on how the result of the experiment was affected by the examples provided would be rather useful.
The authors provide the data in appendixes, contributing to open data and open science. It would be best, however, to store these data in a more processable format (separate, appropriately formatted and machine-processable files) in a public, stable repository (e.g. zenodo or figshare). Any code used in the context of the experiment would be a welcome addition to the dataset.
The use of a public domain LLM would also increase the transparency and stability of the results. Taking into account that this is not a trivial task, this is recommended to be considered in future work. LLAMA 3 is a potential candidate towards this direction.
In the same line, using and assessing alternative prompt formulations would be an important aspect. The authors note this direction as future work, however the provision of some preliminary insight would be a welcome addition.
LLMs are known to include some level of randomness in their replies. Repeating the experiments multiple times and expanding the evaluation to take into account the stochastic aspects of the responses is strongly recommended.
The analysis process could be expanded to include a more detailed classification of errors, e.g. wrong class, wrong relation. Currently, this is only applied to hallucinations.
The organization of the paper is adequate. The manuscript needs to be carefully proofread to improve the quality of English, including breaking down long sentences to smaller ones. If the data and results are moved to a stable online repository as suggested above, the manuscript text could focus on some important/interesting examples, making the text shorter and more focused.
Minor comments:
The statement “no suggestion was given and no suggestion could be valid” could be changed to “no suggestion was given and human experts did not provide any suggestions either”.
Figure 7 is blurry and should be replaced with a more crisp version.
The format of the paper needs to be improved in terms of formatting consistency, while some tables should be presented in landscape mode to increase readability.
|