Evaluating Large Language Models for RDF Knowledge Graph Related Tasks - The LLM-KG-Bench-Framework 3

Tracking #: 3869-5083

Authors: 
lars-peter meyer
Johannes Frey1
Felix Brei
Desiree Heim
Sabine Gründer-Fahrer
Sara Todorovikj
Claus Stadler
Markus Schröder
Natanael Arndt1
Michael Martin1

Responsible editor: 
Guest Editors 2025 LLM GenAI KGs

Submission type: 
Full Paper
Abstract: 
Current Large Language Models (LLMs) can work with structured information and even assist developing program code, but can they support working with Knowledge Graphs (KGs) as well? Which LLM is offering the best capabilities in the field of Semantic Web and Knowledge Graph Engineering (KGE)? Is it possible to determine this without checking many answers manually? The LLM-KG-Bench framework is designed to answer these questions. It consists of an extensible set of tasks for which the LLM answers are automatically evaluated, and covers different aspects of working with semantic technologies. This article gives a description of the LLM-KG-Bench framework, it's main concepts and the tasks implemented. In a benchmark run, a comprehensive dataset has been generated with it, evaluating more than 40 contemporary open and proprietary LLMs. Finally, this dataset is used for an analysis of the SPARQL related capabilities of the LLMs tested.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 18/Jul/2025
Suggestion:
Minor Revision
Review Comment:

This paper presents LLM-KG-Bench, a benchmarking framework aimed at systematically evaluating large language models (LLMs) on tasks relevant to RDF and SPARQL. The work is timely and highly relevant, addressing an underexplored space at the intersection of LLMs and semantic web technologies. While many benchmark suites for LLMs exist, these typically focus on general NLP or programming tasks. In contrast, this framework offers an evaluation suite tailored specifically to knowledge graph engineering, including tasks like SPARQL query formulation, RDF format understanding, and related syntactic and semantic challenges.

Originality
-----------

The paper is part of an ongoing line of work by the same authors and builds directly on multiple prior publications. This version consolidates earlier efforts, refines several tasks, and introduces new evaluations across more than 40 LLMs. While this consolidation has value, the degree of innovation relative to earlier versions is incremental rather than foundational. Still, the benchmark fills an important gap in the current ecosystem, and its thorough engineering and empirical results make it a valuable resource for future work.

Significance
------------

The paper is strongest in its comprehensive coverage of model evaluations and in its systematic characterization of task categories. Section 3.6 (General Task Characterization), which discusses the aspects each task is intended to evaluate, is particularly well done and helps the reader understand both the breadth and the limitations of the framework. The inclusion of both open-weight and proprietary models, along with capability plots and performance breakdowns, further adds to the paper’s value.

One important limitation is that the framework does not evaluate LLMs in interaction with a real graph database. In most practical knowledge graph engineering settings, querying or manipulating a KG requires interfacing with a backend system such as a SPARQL endpoint or a triplestore. By abstracting away this interaction and focusing solely on LLMs’ standalone ability to parse or generate RDF or SPARQL-related content, the benchmark risks overestimating their applicability in real-world settings. Some tasks, such as RdfFriendCount or Sparql2AnswerList, ask LLMs to perform tasks that deterministic SPARQL engines already handle more reliably. This raises the question of whether such tasks reflect meaningful use cases for LLMs, or whether they conflate evaluation of reasoning capabilities with unnecessary redundancy. In contrast, tasks where the LLM serves as an interface to formal languages (e.g. translating between text and SPARQL, or between different RDF serialization formats) seem more aligned with practical applications. I would strongly advice the authors to consider this when introducing the scope of the framework and its potential limitations.

Quality of writing
------------------

The paper is generally clear and well-organized, with helpful diagrams and explanations. There are some areas in the writing that could benefit from clarification:
- Page 4: "an improvement of the analysis of TTL versus JSON..." After reading the paper, it becomes clear that this is about the effect of format used by the LLM, but in a first read at this section it is not clear what this means.
- Page 4: "with scores for IfEval, BBH, MATH, ..." these acronyms are not clear and there is no citation or further explanation of what these benchmarks are, or why they are relevant in the discussion.
- Page 10: "Benchmark tasks in LLM-KG-Bench do implement the interface AbstractLlmKgBenchTaskInterface which enables a rough compatibility with the BigBench task classes" What's the point of this interface? Why is important to enable rough compatibility with the BigBench task classes?

Artifacts
---------

On the resource side, the GitHub repository is accessible and contains useful documentation, although some of the data used in the experiments is only available upon request. This unfortunately limits the extent to which other researchers can replicate the results out of the box.

Conclusion
----------

Overall, this is a solid paper with a potentially impactful contribution to the semantic web. I'm a bit concerned about the incremental differences across the series of papers that this paper extends, but I lean towards acceptance nonetheless thanks to the additional context of section 3.6 and new empirical results. There are some clarifications that are needed, as detailed above, so I recommend a minor revision.

Review #2
By Michael Röder submitted on 24/Jul/2025
Suggestion:
Minor Revision
Review Comment:

# Publication Summary

The paper presents the current state of the benchmarking framework LLM-KG-Bench, which has been created for automatically assessing and comparing Large Language Models (LLMs) with respect to their capabilities to work with Semantic Web technologies. Within this work, the authors focus on evaluations related to the LLMs' capabilities to process and generate SPARQL SELECT and RDF data. The framework is described in-detail in Section 3. The authors show the usefulness of the evaluation framework in Section 4 by comparing the performance of 41 LLMs in several tasks and draw conclusions from these results, e.g., which RDF format certain LLMs prefer.

This paper is an extended version of Lars-Peter Meyer, Johannes Frey, Desiree Heim, Felix Brei, Claus Stadler, Kurt Junghanns, and Michael Martin: "LLM-KG-Bench 3.0: A Compass for Semantic Technology Capabilities in the Ocean of LLMs" published at the ESWC 2025. In comparison to the previous publications, the authors increased the number of LLMs that they evaluate and enhanced the analysis of the evaluation results.

# Review Summary

## Originality

There are several works that look at the performance of LLMs including tasks related to knowledge graphs. The submitted work itself lists several related articles that evaluate LLMs in similar ways. However, it seems like the presented work provides a large set of different SPARQL- and RDF-related tasks, includes connectors to a large number of LLMs and offers automatic evaluations. The latter point is especially important as manual or crowd-based evaluations are costly.

## Significance of the Results

The authors present some significant insights. They are not only able to compare the performance of the evaluated LLMs for a single task but show that based on their framework further insights can be gathered. It is also pointed out that all evaluation data is collected and made available for further analysis. In addition, intermediate results (i.e., the answers of the LLMs) are stored by the framework and can be evaluated again in case further analysis are implemented.

## Quality of Writing

The weakest part of the paper seems to be its presentation. A paper should be self-contained (I know that this concept has limitations). However, this paper builds upon previous works in a way that it is nearly impossible to understand it without having read the previously published papers about the framework, the tasks and especially the scores that are used in the evaluation. I think this needs to be improved. Especially the points raised in the following need the author's attention:

- The introduction clearly states that it is an extension of [4]. However, the paper itself refers to [4] several times, which looks very strange to me as I would expect that the extension includes everything that [4] contains.
-- The sentence "For SSF and T2S we selected the max_combined score, which is 0.2 for syntactically correct but wrong queries." is a good example that the paper is missing critical information. The max_combined score is not explained, but used in Figure 5. The same holds for the combinedF1 score, which is also not explained. Since these scores do not seem to be very common (except one has read all the related work), it is necessary to briefly explain them.
- Section 3.1 seems to have an unfortunate structure at the moment. It starts by presenting Figure 1, but instead of explaining it, the text directly continues to explain concepts, which are not shown in Figure 1 but in Figure 2. It would be very important to give the reader more guidance. I can imagine two solutions 1) Start with an overview and explain it; then dive deeper into the framework and explain specific parts like the Prompt-Answer-Evaluate loop. OR 2) Start with an important core concept, like the Prompt-Answer-Evaluate loop and then explain the parts of the evaluation framework around it that enable the loop to do its work.
-- "Task Case Entries" (p 8) are mentioned but not explained in this paragraph. Instead, they are explained on page 10.
-- The names of the concepts in 3.1 and Figures 1 and 2 are not well aligned. "Task Evaluation Iterations" vs. "task evaluation execution"; "Task Executions" vs. "task iteration"; "List Tasks" vs. "taskList" vs. "Task Collection"(?)
-- "Result Reevaluation" does not seem to exist in Figure 1. Or is it optional?
- I would suggest that parts of the text should be reformulated to be more precise as some formulations are either not helpful or not correct.
-- "Moreover, what is generally amiss is a benchmark execution framework that helps to deal with the particularities of RDF and KG-related workloads (format parsing, syntax check feed back loops, execution and evaluation of queries towards KGs, etc.)." --> I am surprised to see that the sentence that should list the gaps that the authors (presumably) want to tackle contains a listing with "etc." How do we know whether the suggested benchmarking framework fulfills the goals of the authors if the list is not complete? Which other gaps might be hidden in "etc."?
-- Table 1 contains the terms "many" and "several" without that it is clear 1) why the authors might not be able to provide an exact number of LLMs or list of tasks and 2) what these terms mean (more than 10, 20, 40?).
-- "This library is compatible to many open LLMs and enables serving and inferencing them." --> I guess it should be something like "inferencing with them". Otherwise, it means that one would try to infer a complete LLM, which doesn't seem to make sense.
-- In Section 3.5, some datasets are introduced. However, I am missing hard facts with respect to the size of the datasets that go beyond general descriptions like "small". A table summarizing the main features of the datasets could be helpful.
-- In Section 4.1.3, the authors present criteria which LLMs are included in the evaluation. The model "solar-pro-preview-instruct" is excluded "since it only supports a context length of up to 4096k Tokens". I assume that the "k" is misplaced. While the selection of LLMs is explained in detail, the statement "Additionally, we included a Qwen3 and Llama4 model as well as two DeepSeek models." comes without any further explanation. It would be good if the authors could explain their rationale, why these models are included. Otherwise, the detailed explanation of the model selection does not make much sense to me.
-- Figure 5 shows box plots and seems to combine them with a representation of the arithmetic mean. 1) The authors should explicitly mention that it is the arithmetic mean, since a box plot also contains the median. 2) The circle that the authors use for the arithmetic mean always has a white background, regardless whether the circle is on a box, which has a colored background. It would be nice if this could be fixed (although it is not the most important issue).
-- It is stated that the "values behind these scores" (in Table 6) "have a high variance". Would it be possible to put the variance or standard deviation values into the Table? That would make it much easier to follow and more transparent.
- I would suggest that the font used in tables and figures should not be smaller than the font used in the text. All tables and nearly all figures seem to have smaller fonts, which make them hard to read. I understand the necessity to ensure that Tables 5 and 7 fit on a single page. However, I am not aware of a page limit for this journal article so I am wondering why the other tables and figures shouldn't receive the space that their content deserves.
- The related work section lists a long list of related publications. However, a comparison with the presented work is missing and should be added to emphasize the importance of LLM-KG-Bench.
- The paper briefly mentions "Task Classes and Parameterized Tasks". However, from the two sentences at that point, I only understood that task classes are used to organize tasks. I neither got the benefit of parameterized tasks (I guess any task may have some parameters) nor parameterized task classes. Later on, the authors dive down to implementation details and use the term "task class" there again, which makes me wonder whether I really understood task classes and I am sure that I didn't get parameterized task classes. So maybe the authors could improve the description of this part.

Overall, the writing itself is good but it would be beneficial if the paper could be proof read again. Some of the typos I found are listed at the end of the review.

## Open Science Data

The repeatability of the experiments seems to be good. However, there might be space for improvement w.r.t. the documentation.

- The framework itself is hosted as an open-source project on github and has a DOI on Zenodo. The installation and usage of the framework are described in the readme file. However, I didn't try to execute it. Further documentation is spread across several linked files. However, I miss a list of existing tasks (at least I couldn't find one). These task names seem to be necessary to configure the framework and, hence, a complete overview would be very helpful to attract more users.
- The experiment results are shared on github in a separate project. The project structure is document in the readme file.

## Conclusion

In my humble opinion, this is a good submission but it needs a minor revision, because of the problems pointed out before.

## List of smaller errors and suggestions
This part follows the pattern: Original text --> suggestion

### Abbreviations
While common abbreviations do not have to be introduced, several abbreviations used in this paper do not seem to be very common.
- p 4: "For open models, the Open-LLM-Leaderboard 3 provides a list of benchmark results with over 2,000 tested models, and with scores for IfEval, BBH, MATH, GPQU, MUSR, MMLU, plus a carbon dioxide emission estimate." --> The authors should either introduce the abbreviations or they should describe them. Just listing score names without any additional information does not seem to be very helpful.
- p 5: "using HumanEval, MultiPL-E, MBPP" --> the meaning of the first can be guessed but the other two remain unclear.
- p 5: "or RML generation (e.g.,14)." --> I had to look it up, since I was not sure what it could mean. It might be nice to introduce this abbreviation.
- p 6: "large language models (LLMs)" --> "LLMs"; it is not necessary to introduce an abbreviation twice
- p 6: I would suggest not introducing the abbreviations PLMs and NPLMs since both do not seem to be used again in the text.
- p19: The introduction of the abbreviations "SparqlSyntaxFixing(SSF), Sparql2Answer(S2A) and Text2Sparql(T2S)." comes too late since they have been already used on the previous pages. I would suggest introducing them directly when the task itself is defined.

### Typos, etc.
Text to remove is marked on the left, insertions are marked on the right with "[]", respectively.
- p 1: "LLMs.Finally" --> "LLMs.[ ]Finally"
- p 2: "Large Language Models (LLMs)[,] makes it difficult" --> "Large Language Models (LLMs) makes it difficult"
- p 3: "more than 40 current LLMs and an evaluation of the SPARQL SELECT related capabilities." --> "more than 40 current LLMs and an evaluation of the[ir] SPARQL-SELECT-related capabilities."
- p 3: "main concepts, API," --> either "APIs" or "its API"
- p 3: "model connectors and tasks" --> "model connectors and tasks[.]"
- p 3: "[providing] a unified source" --> "a unified source"
- p 4: "based on prompts given users" --> "based on prompts given [by] users"
- p 5: "Several leaderboards, like Big Code Models Leaderboard for Java, Javascript,and CPP, or EvalPlus Leaderboard for Python[,] assess the coding proficiency of LLMs using HumanEval, MultiPL-E, MBPP[, respectively]." --> "Several leaderboards, like [the] Big Code Models Leaderboard for Java, Javascript, and CPP, or [the] EvalPlus Leaderboard for Python assess the coding proficiency of LLMs using HumanEval, MultiPL-E, [and] MBPP." ("respectively" doesn't seem to fit here, since the first list contains 2 leaderboards while the second contains 3 elements)
- p 6: "accessing KG," --> "accessing KG[s]," or "accessing [a] KG,"
- p 7: "Dubey et al. extend the LC-QuAD dataset [with] LC-QuAD 2.0" --> Dubey et al. extend the LC-QuAD dataset [forming] LC-QuAD 2.0" (or a similar formulation)
- p 8: "infrastructure of LLM-KG-Bench framework" --> "infrastructure of [the] LLM-KG-Bench framework"
- p 8: "The main architecture is described in fig. 1, adapted version from Meyer et al.32." --> These two sentences contain several issues. I would suggest merging them into something like "Fig. 1 shows the architecture of the LLM-KG-Bench framework." The reference to [32] is not necessary, since the caption of the figure already refers to it.
- p10: "data can be fe[e]d into" --> "data can be fed into"
- p11: "Benchmark tasks in LLM-KG-Bench [do] implement" --> "Benchmark tasks in LLM-KG-Bench implement"
- p12: "prompt-answer-evaluate loop(fig. 2a)" --> "prompt-answer-evaluate loop[ ](fig. 2a)"
- p12: "the LLMs answer" --> "the LLM[']s answer"
- p15: "published [as well] another" --> "published another"
- p15: "fit into context size" --> "fit into [the] context size"
- P16: Table 3: column captions typically start with capital letters
- p17: "With the reevaluation feature the evaluation on these dialogs can get computed again with maybe updated code without new interaction with the LLMs." --> "The reevaluation feature allows to recalculate evaluation results without a new interaction with the LLMs." However, exactly the same statement already exists in Section 3. Hence, I would suggest to remove it from Section 4 since it is just repeating information that was already given and it doesn't seem to be of further importance to Section 4.
- p17: "data generated is6." --> "data generated is [presented by Heim et al.]6."
- p19: "For each task a box" --> "For each task[,] a box"
- p19: "Due to limited space, we show [here] a selection" --> "Due to limited space, we show a selection"
- p19: "Figures 5a and 5c show several LLMs seem to have no" --> "Figures 5a and 5c show several LLMs [that] seem to have no"
- p20: "Parameter" column --> "Parameters", "Parameter count", "Number of parameters", or "#Parameters"
- p20: "128k-1M" --> "128k--1M" (please use an n-dash instead of a hyphen for value ranges)
- p20: "235B(active 22B)" --> "235B[ ](active 22B)"
- p20: "As [can be seen in fig. 5b] several LLMs" --> "As [fig. 5b shows,] several LLMs"
- p21: Whitespaces are missing in Figure 5, e.g., "SparqlSyntaxFixing(SSF)" --> "SparqlSyntaxFixing[ ](SSF)"
- p21: "In fig. 5c a box plot" --> "In fig. 5c[,] a box plot"
- p21: "To get a quick overview on LLM models the" --> "To get a quick overview on LLM models[,] the"
- p22: "For the analysis of SPARQL capabilities we [decided to] use the following categories:" --> "For the analysis of SPARQL capabilities we use the following categories:"
- p22: All first sentences of the categories are not correct sentences, e.g., "Working with SPARQL SELECT queries with first answer."
- p22: "fig. 6 shows a" --> "Fig. 6 shows a"
- p27: "which does not necessary mean" --> "which does not necessar[il]y mean"
- p27: "And as the number of LLMs published and the size of datasets generated with the framework grow[s]" --> "And as the number of LLMs published and the size of datasets generated with the framework grow" (a number and a size = plural)
- p27: "highlight this findings" --> "highlight these findings"
- p27: "This way the evaluation could" --> "This way[,] the evaluation could"

### Suggestions
- I would suggest to use hyphens in compound adjectives since it eases the reading, e.g.:
- p 1: "SPARQL related" --> "SPARQL-related"
- p 3: "Knowledge Graph related" --> "Knowledge-Graph-related"
- p 3: "SPARQL SELECT query related" --> "SPARQL-SELECT-query-related"
- I would suggest sticking to a single writing of terms. For example:
- "Prompt-Answer-Evaluation loop" vs. "prompt-answer-evaluation loop" (e.g., Figure 2(a) vs. Figure 2(b) but also within the text)
- "LLM-KG-Bench" vs. "LLM-KG-bench"
- When referring to BigBench (e.g., on page 11 and following), the publication should be cited accordingly.
- In several cases, the citations are part of the sentence, e.g. "See also 31,32." Since the format of the journal seems to display the citations as endnotes, I would suggest mentioning the authors in these cases to make the sentences more readable, e.g., "See also Frey et al.31 and Meyer et al.32."
- Table 7 seems to be missing the \bottomrule, Table 6 is missing bottom and top rules.

Review #3
By Bohui Zhang submitted on 10/Nov/2025
Suggestion:
Minor Revision
Review Comment:

The paper proposes **LLM-KG-Bench**, a framework designed to evaluate the capability of large language models (LLMs) on knowledge graph (KG)-related tasks. This work largely fills a gap in the field, addressing the current lack of a systematic and automated evaluation benchmark for applying LLMs to KG-centric problems. The authors conduct extensive experiments using a carefully curated set of LLMs across multiple tasks, and their findings on LLM capabilities and preferences are generally well supported by the experimental results.

Strengths:

- The benchmark design and experimental setup focus on core KG tasks, which enhances the impact and relevance of the work.
- The research questions are mostly well-defined and aligned with key issues raised by the community.
- The selection of LLMs is well-documented, offering transparency and reproducibility.
- The inclusion of detailed technical documentation makes the framework easier to understand and reuse.
- The visualizations are informative and aid comprehension of the results.

Weaknesses and Suggestions for Improvement:

- Writing quality (especially Section 3) should be improved. The current version reads more like technical documentation for a software package rather than a scientific paper. It lacks sufficient detail on the benchmark’s evaluation task formulations, construction process, and dataset descriptions (e.g., dataset sizes).
- Figure 1 requires clearer explanations or a more detailed legend, as it is currently difficult to interpret.
- Subsection 3.2 does not align coherently with Figure 3, making the technical flow difficult to follow.
- Subsection 3.3 could be simplified, as it overlaps with the LLM selection discussion in Section 4.
- Subsection 2.3 could be divided into two parts, as Text2SPARQL and KBQA are described separately and deserve clearer distinction.
- In the Related Work section, the claim that “none addresses Knowledge Graph Engineering (KGE) tasks” is too strong. It would be more accurate to acknowledge existing benchmarks—-such as those derived from the LAMA dataset [1] and the LM-KBC challenge series [2–4]—-as well as SPARQL-related benchmarks [5]. The statement could be rephrased as: “There remains a lack of a holistic benchmark addressing knowledge engineering tasks”.
- The relationship between the Text2AnswerList task and the existing KBQA benchmarks is unclear. If these tasks are equivalent or closely related, the authors should explain why existing KBQA benchmarks were not reused.
- In Table 1, the term “several” should be replaced with the actual list of topics for better clarity and comparison.
- Table 2 is included in the paper but not referenced or discussed in the text. It should be explicitly mentioned and explained.
- As mentioned before, details of the KGs used in Subsection 3.5 are insufficient. Information such as dataset sizes and composition should be added.
- The reevaluation feature is briefly mentioned but lacks explanation regarding its utility and stability, especially in the context of rapidly changing proprietary LLMs. Further justification would strengthen this point.
- It is unclear why experiments were conducted on only 6 out of 9 tasks. Clarifying the selection rationale would improve transparency.
- The scoring methods used to report experimental results are not sufficiently explained. The paper should describe what the scores represent and how they are calculated.
- While the experiments are comprehensive, the results analysis is overly concise. Simplifying earlier technical sections could make room for deeper discussion and interpretation of the findings.

Overall, this is a valuable and promising paper that addresses an important gap in LLM evaluation for knowledge graph tasks. With improvements to writing clarity, methodological explanation, and result interpretation, the paper could make a strong contribution to the community.

I therefore recommend **minor revision**.

*(Please feel free to correct me if I have misunderstood any parts of the paper.)*

[1] https://arxiv.org/abs/1909.01066

[2] https://ceur-ws.org/Vol-3577/paper0.pdf

[3] https://ceur-ws.org/Vol-3853/paper0.pdf

[4] https://ceur-ws.org/Vol-4041/paper7.pdf

[5] https://arxiv.org/abs/2407.11417