Integrating Wikidata Entities into Narrative Graphs Using Large Language Models

Tracking #: 3889-5103

Authors: 
Emanuele Lenzi
Valentina Bartalesi

Responsible editor: 
Mehwish Alam

Submission type: 
Full Paper
Abstract: 
Narratives are essential tools for articulating and sharing human experiences, particularly in scientific and cultural domains where they aid in the explanation of complex phenomena. In the context of a broader scientific effort in which Knowledge Representation and Semantic Web technologies are used to transform raw textual data into formal narratives, this paper explores the capability of Large Language Models (LLMs) to associate narrative entities (e.g. persons, locations, keywords) with the corresponding entity identifiers from Wikidata. We propose three LLM-based approaches for extracting and linking narrative entities and compare their performance against the JSI Wikifier, a state of-the-art entity linking tool. The evaluation is based on a dataset from the H2020 MOVING project, which focuses on sustainability and value chains in European mountain regions. This study highlights the potential of LLMs to improve semantic annotation workflows and contribute to the automated generation of semantically enriched narratives.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ines Blin submitted on 30/Jun/2025
Suggestion:
Major Revision
Review Comment:

#Summary

This paper investigates the use of LLMs for narrative entity extraction and linking to Wikidata, with a focus on sustainability and value chains in European mountain regions. The authors propose and evaluate three types of approaches, using a dataset of 30 annotated narratives, and compare the performance of nine LLMs against baseline methods. While the models enriched with Wikipedia-based identification and a Jaccard index scoring strategy perform best, overall results remain average, underscoring the difficulty of the task.

#General comments

##Originality, quality, importance and impact

The paper addresses a relevant and challenging problem at the intersection of NLP, KGs, and narrative analysis. However, the proposed methods, largely consisting of openly accessible LLMs and APIs, are not particularly novel.

The most valuable contribution appears to be the dataset focused on European mountain territories and value chains. Unfortunately, this is underemphasised in the manuscript. Given its uniqueness, it could play a more central role in the framing of the paper.

A limitation is generalisability: the dataset contains only 30 narratives, and the LLMs are evaluated solely on this dataset and a single knowledge graph (Wikidata). As a result, it is difficult to assess how well the findings would transfer to other domains, larger datasets, or different KGs.

##Clarity and readability

The paper would benefit from clearer structure and more visual support. Currently, methodology, results, and experimental setup are somewhat intermingled, which affects readability and coherence. One possible structure might be:

1. Introduction
2. Related Work (currently missing)
3. Methodology
- Current section on “Task Definition”
- Overview of approaches
- Some paragraphs of the Results section should maybe be in this section instead, like the algorithm to extract Wikidata QIDs with SPARQL queries
- Approach 2 seems to have two sub-approaches (with/without Natomo’s in context-prompting), which only appears in the Results section
4. Dataset
- Previous section on “Dataset and Gold Standard Corpus”
5. Experimental Setup
- Introduce the metrics (previously in Methodology section)
- Previous “LLM Selection” section
- Previous “Baseline selection” section
6. Results
- Several elements of the results section are repeated in details without adding much insight (e.g. detailed description of minor improvements of each of the nine models). Presenting all results in a single comparative table, rather than per approach, would reduce redundancy and improve readability.
7. Discussion
8. Conclusion

It would be helpful to explicitly hypothesise how the different approaches are expected to perform before presenting results. For example, one might expect that each successive approach adds more knowledge base enrichment and thus improves performance.

Lastly, adding visualisations or examples would greatly help:

- The “semi-automatic workflow” (mentioned in introduction)
- The case study (in the introduction or in the dataset preparation)
- The differences between the approaches
- In this context, the example between TP/FP/FN

##Provided data and sustainability

- Code and dataset are publicly available (Github)
- The following could be clarified further:
- Which programming language versions were used (Python/Java)
- A description of the dataset contents and columns (e.g. “objectlinks”, “bar_lon”)

#Detailed comments

##Introduction

- The sentence “narratives are particularly significant in the scientific context (…) and share specialised knowledge” could be further described
- What are the “meaningful relationships” that connect events, could you give examples?

##Methodology

- Could you clarify why the approach is “bottom-up”?
- Consider introducing all approaches here and explaining the rationale for each. Why is the method with the Jaccard index not considered a fourth approach?

##Task Definition

- The section should distinguish more clearly between “entities” and “keywords”. The terms seems to be used interchangeably, which is confusing.

##LLM Selection

- Could you briefly introduce the open science approach/principles, so that the link between OpenScience and the requirements is clearer? The description of the hardware used for the experiments is quite vague, and may not be informative for readers unfamiliar with technical specifications. It would help to clarify the setup or explain why these specifications.
- The model selection process ("we decided to select nine models") should be more detailed—what were the criteria? Are these the only 9 models that respected the inclusion criteria?
- Could you include citations or links to official sources for each model?

##Dataset and Gold Standard Corpus

- A small illustrative example from the dataset would help readers understand its structure and content
- It is unclear who annotated the dataset, what guidelines were used, and whether inter-annotator agreement was assessed
- Why did you choose this “determination formula”, and could you give an intuition of it? It is also not clear how you end up choosing 30 narratives from the formula. Z=1.96, p=0.5, MOE=0.05, so we can get n0, but what is N?

##Baseline selection

- Why was TENET discarded? If it combines entity and relation linking, could it not still serve as a reference point?
- “Using Python scripts” can be omitted
- TAGME achieved the highest recall—why was it not retained alongside JSI Wikifier?

##Results

- Rounding all metric values to two decimal places would improve readability
- The claim that LLMs often return “fictional or incorrect QIDs” could be quantified (e.g., proportion of incorrect QIDs).
- Some interpretation of results (e.g. that “keyword definition helps the model”) should be more cautious when performance gains are marginal

##Introducing the Jaccard Index to increase the LLM results

- Statements such as “we obtained the best results at threshold X” could be backed by explicit observations—are these results reported somewhere?
- Again, many improvements appear to be minor

##Discussion

- The dataset annotation is described both as having been done by “experts” and by “we”: could you clarify who the annotators were and their level of expertise?
- The absence of some entities in Wikidata is mentioned and underlines an important limitation of KGs. It could be mentioned further for future work.

##Conclusion

- It would be stronger if more concise and forward-looking, rather than repeating earlier content.

Review #2
Anonymous submitted on 27/Aug/2025
Suggestion:
Major Revision
Review Comment:

The paper presents a method for entity linking that leverages Large Language Models (LLMs) in a domain-specific context (sustainability and value chains), with a focus on resource-constrained scenarios: the study evaluates small, open-source LLMs that can run on limited hardware. A subset of the H2020 MOVING dataset is used as a case study, where textual narratives describing events are linked to Wikidata entities (persons, locations, concepts). The authors test three prompting strategies for retrieving Wikidata QIDs: (1) direct linking, (2) identifying entities and retrieving QIDs via the Wikidata SPARQL endpoint, and (3) detecting entities together with their Wikipedia titles and retrieving QIDs using the Wikipedia APIs. The evaluation compares nine lightweight, open-source LLMs.

The paper addresses an interesting problem — performing entity linking with limited computational resources — and makes an initial attempt at benchmarking lightweight, open-source LLMs in this area. However, several issues limit its contribution in the current form.
The methodological description lacks sufficient detail: although the study emphasizes different prompting strategies (direct linking, SPARQL lookup, Wikipedia redirects), the actual prompts are not reported in the text. While the code is available, explicit reporting of prompts in the paper is essential. The paper does not provide a thorough discussion of the limitations of current state-of-the-art techniques. ReLiK, GENRE, and TENET are excluded on the grounds that they cannot handle both keywords and named entities, but this justification remains high-level and is not illustrated with concrete examples, leaving the analysis of existing methods incomplete. Regarding baselines, although JSI Wikifier, TAGME, and Falcon 2.0 are evaluated, the analysis is limited to reporting their scores. The rationale for ultimately selecting JSI Wikifier as the baseline would benefit from a deeper discussion of why its characteristics make it the most appropriate comparator for the proposed methods. Moreover, although the authors attribute the low performance to the challenging nature of the data, the ground truth dataset itself shows quality issues — such as typos and inconsistencies — which undermine the reliability of the evaluation. Finally, the paper provides only illustrative error cases and lacks a structured analysis that would systematically quantify and categorize these issues. Overall, a major revision is required before publication.

Strengths

- The paper focuses on resource-constrained settings, evaluating and comparing nine small open-source LLMs.

- The paper introduces a domain-specific case study and compares three different strategies (direct linking, SPARQL lookup, Wikipedia redirects).

Weaknesses

- The absence of a dedicated related work section limits the clarity of how the contribution relates to prior work.

- The discussion of baselines is limited: while JSI Wikifier, TAGME, and Falcon 2.0 are compared, the analysis does not extend beyond reporting scores, and the implications of their differences for the task are not explored in depth.

- Prompt design, which is central to the proposed method, is not reported, undermining reproducibility and preventing readers from assessing the design choices.

- No structured error analysis is provided, as domain-specific challenges are discussed only qualitatively and illustrated through a few selective examples, without systematic evaluation.

Minor issues

Typo: Falco 2.0 in Table 2.

Review #3
Anonymous submitted on 07/Sep/2025
Suggestion:
Reject
Review Comment:

This paper presents an exploration of how open-source LLMs can be integrated into semantic web workflows for entity linking in narrative graphs.
The authors test LLMs in combination with Wikidata SPARQL queries and Wikipedia APIs, in a layered way, showing the incremental contribution of these tools on top of the small LLMs.

The work is rather original as it is the first application of these technologies to narrative graphs. The stepwise refinement shows also the weaknesses of naïve prompting. These informations are not completely new and sort of expected, however, therefore the innovation value is moderate.

About the results obtained, the dataset is limited in size (30 narratives with 330 entities). I understand the justification based on the sample size determination formula to indicate that this subset is sufficient to reproduce the results expected on the larger sample of 454 narratives. However, the provided justification is exclusively statistic. It does not take into account the variability of descriptions in language. Also, despite the indication of three clear groups/types of narratives, it is not said whether the subset of 30 narratives include representants from all groups.

Looking at the results, considering the 0.05 error margin included by this selection, it is clear that in most tables the F1 is not statistically different for the top 3-4 results, limiting the possibility of interpretation of the results. Some results are quite surprising such as 9b models exceeding the performance of larger ones. A graph linking model size and results would have been useful.

The lack of a comparison with closed models (eg GPT-n) does not allow to appreciate how far are these results from those of larger, commercial models.

The annotation was carried out only by an expert (at least two would be recommended for these tasks).

Note also that the format doesn't seem appropriate to the SWJ style.

In conclusion, given the drawbacks previously listed, it seems that the paper would need a major overhaul with more experiments and dataset re-annotation. It seems too much for a revision, so I'm proposing rejection.