Review Comment:
#Summary
This paper investigates the use of LLMs for narrative entity extraction and linking to Wikidata, with a focus on sustainability and value chains in European mountain regions. The authors propose and evaluate three types of approaches, using a dataset of 30 annotated narratives, and compare the performance of nine LLMs against baseline methods. While the models enriched with Wikipedia-based identification and a Jaccard index scoring strategy perform best, overall results remain average, underscoring the difficulty of the task.
#General comments
##Originality, quality, importance and impact
The paper addresses a relevant and challenging problem at the intersection of NLP, KGs, and narrative analysis. However, the proposed methods, largely consisting of openly accessible LLMs and APIs, are not particularly novel.
The most valuable contribution appears to be the dataset focused on European mountain territories and value chains. Unfortunately, this is underemphasised in the manuscript. Given its uniqueness, it could play a more central role in the framing of the paper.
A limitation is generalisability: the dataset contains only 30 narratives, and the LLMs are evaluated solely on this dataset and a single knowledge graph (Wikidata). As a result, it is difficult to assess how well the findings would transfer to other domains, larger datasets, or different KGs.
##Clarity and readability
The paper would benefit from clearer structure and more visual support. Currently, methodology, results, and experimental setup are somewhat intermingled, which affects readability and coherence. One possible structure might be:
1. Introduction
2. Related Work (currently missing)
3. Methodology
- Current section on “Task Definition”
- Overview of approaches
- Some paragraphs of the Results section should maybe be in this section instead, like the algorithm to extract Wikidata QIDs with SPARQL queries
- Approach 2 seems to have two sub-approaches (with/without Natomo’s in context-prompting), which only appears in the Results section
4. Dataset
- Previous section on “Dataset and Gold Standard Corpus”
5. Experimental Setup
- Introduce the metrics (previously in Methodology section)
- Previous “LLM Selection” section
- Previous “Baseline selection” section
6. Results
- Several elements of the results section are repeated in details without adding much insight (e.g. detailed description of minor improvements of each of the nine models). Presenting all results in a single comparative table, rather than per approach, would reduce redundancy and improve readability.
7. Discussion
8. Conclusion
It would be helpful to explicitly hypothesise how the different approaches are expected to perform before presenting results. For example, one might expect that each successive approach adds more knowledge base enrichment and thus improves performance.
Lastly, adding visualisations or examples would greatly help:
- The “semi-automatic workflow” (mentioned in introduction)
- The case study (in the introduction or in the dataset preparation)
- The differences between the approaches
- In this context, the example between TP/FP/FN
##Provided data and sustainability
- Code and dataset are publicly available (Github)
- The following could be clarified further:
- Which programming language versions were used (Python/Java)
- A description of the dataset contents and columns (e.g. “objectlinks”, “bar_lon”)
#Detailed comments
##Introduction
- The sentence “narratives are particularly significant in the scientific context (…) and share specialised knowledge” could be further described
- What are the “meaningful relationships” that connect events, could you give examples?
##Methodology
- Could you clarify why the approach is “bottom-up”?
- Consider introducing all approaches here and explaining the rationale for each. Why is the method with the Jaccard index not considered a fourth approach?
##Task Definition
- The section should distinguish more clearly between “entities” and “keywords”. The terms seems to be used interchangeably, which is confusing.
##LLM Selection
- Could you briefly introduce the open science approach/principles, so that the link between OpenScience and the requirements is clearer? The description of the hardware used for the experiments is quite vague, and may not be informative for readers unfamiliar with technical specifications. It would help to clarify the setup or explain why these specifications.
- The model selection process ("we decided to select nine models") should be more detailed—what were the criteria? Are these the only 9 models that respected the inclusion criteria?
- Could you include citations or links to official sources for each model?
##Dataset and Gold Standard Corpus
- A small illustrative example from the dataset would help readers understand its structure and content
- It is unclear who annotated the dataset, what guidelines were used, and whether inter-annotator agreement was assessed
- Why did you choose this “determination formula”, and could you give an intuition of it? It is also not clear how you end up choosing 30 narratives from the formula. Z=1.96, p=0.5, MOE=0.05, so we can get n0, but what is N?
##Baseline selection
- Why was TENET discarded? If it combines entity and relation linking, could it not still serve as a reference point?
- “Using Python scripts” can be omitted
- TAGME achieved the highest recall—why was it not retained alongside JSI Wikifier?
##Results
- Rounding all metric values to two decimal places would improve readability
- The claim that LLMs often return “fictional or incorrect QIDs” could be quantified (e.g., proportion of incorrect QIDs).
- Some interpretation of results (e.g. that “keyword definition helps the model”) should be more cautious when performance gains are marginal
##Introducing the Jaccard Index to increase the LLM results
- Statements such as “we obtained the best results at threshold X” could be backed by explicit observations—are these results reported somewhere?
- Again, many improvements appear to be minor
##Discussion
- The dataset annotation is described both as having been done by “experts” and by “we”: could you clarify who the annotators were and their level of expertise?
- The absence of some entities in Wikidata is mentioned and underlines an important limitation of KGs. It could be mentioned further for future work.
##Conclusion
- It would be stronger if more concise and forward-looking, rather than repeating earlier content.
|