Review Comment:
# Publication Summary
The paper presents the current state of the benchmarking framework LLM-KG-Bench, which has been created for automatically assessing and comparing Large Language Models (LLMs) with respect to their capabilities to work with Semantic Web technologies. Within this work, the authors focus on evaluations related to the LLMs' capabilities to process and generate SPARQL SELECT and RDF data. The framework is described in-detail in Section 3. The authors show the usefulness of the evaluation framework in Section 4 by comparing the performance of 41 LLMs in several tasks and draw conclusions from these results, e.g., which RDF format certain LLMs prefer.
This paper is an extended version of Lars-Peter Meyer, Johannes Frey, Desiree Heim, Felix Brei, Claus Stadler, Kurt Junghanns, and Michael Martin: "LLM-KG-Bench 3.0: A Compass for Semantic Technology Capabilities in the Ocean of LLMs" published at the ESWC 2025. In comparison to the previous publications, the authors increased the number of LLMs that they evaluate and enhanced the analysis of the evaluation results.
# Review Summary
## Originality
There are several works that look at the performance of LLMs including tasks related to knowledge graphs. The submitted work itself lists several related articles that evaluate LLMs in similar ways. However, it seems like the presented work provides a large set of different SPARQL- and RDF-related tasks, includes connectors to a large number of LLMs and offers automatic evaluations. The latter point is especially important as manual or crowd-based evaluations are costly.
## Significance of the Results
The authors present some significant insights. They are not only able to compare the performance of the evaluated LLMs for a single task but show that based on their framework further insights can be gathered. It is also pointed out that all evaluation data is collected and made available for further analysis. In addition, intermediate results (i.e., the answers of the LLMs) are stored by the framework and can be evaluated again in case further analysis are implemented.
## Quality of Writing
The weakest part of the paper seems to be its presentation. A paper should be self-contained (I know that this concept has limitations). However, this paper builds upon previous works in a way that it is nearly impossible to understand it without having read the previously published papers about the framework, the tasks and especially the scores that are used in the evaluation. I think this needs to be improved. Especially the points raised in the following need the author's attention:
- The introduction clearly states that it is an extension of [4]. However, the paper itself refers to [4] several times, which looks very strange to me as I would expect that the extension includes everything that [4] contains.
-- The sentence "For SSF and T2S we selected the max_combined score, which is 0.2 for syntactically correct but wrong queries." is a good example that the paper is missing critical information. The max_combined score is not explained, but used in Figure 5. The same holds for the combinedF1 score, which is also not explained. Since these scores do not seem to be very common (except one has read all the related work), it is necessary to briefly explain them.
- Section 3.1 seems to have an unfortunate structure at the moment. It starts by presenting Figure 1, but instead of explaining it, the text directly continues to explain concepts, which are not shown in Figure 1 but in Figure 2. It would be very important to give the reader more guidance. I can imagine two solutions 1) Start with an overview and explain it; then dive deeper into the framework and explain specific parts like the Prompt-Answer-Evaluate loop. OR 2) Start with an important core concept, like the Prompt-Answer-Evaluate loop and then explain the parts of the evaluation framework around it that enable the loop to do its work.
-- "Task Case Entries" (p 8) are mentioned but not explained in this paragraph. Instead, they are explained on page 10.
-- The names of the concepts in 3.1 and Figures 1 and 2 are not well aligned. "Task Evaluation Iterations" vs. "task evaluation execution"; "Task Executions" vs. "task iteration"; "List Tasks" vs. "taskList" vs. "Task Collection"(?)
-- "Result Reevaluation" does not seem to exist in Figure 1. Or is it optional?
- I would suggest that parts of the text should be reformulated to be more precise as some formulations are either not helpful or not correct.
-- "Moreover, what is generally amiss is a benchmark execution framework that helps to deal with the particularities of RDF and KG-related workloads (format parsing, syntax check feed back loops, execution and evaluation of queries towards KGs, etc.)." --> I am surprised to see that the sentence that should list the gaps that the authors (presumably) want to tackle contains a listing with "etc." How do we know whether the suggested benchmarking framework fulfills the goals of the authors if the list is not complete? Which other gaps might be hidden in "etc."?
-- Table 1 contains the terms "many" and "several" without that it is clear 1) why the authors might not be able to provide an exact number of LLMs or list of tasks and 2) what these terms mean (more than 10, 20, 40?).
-- "This library is compatible to many open LLMs and enables serving and inferencing them." --> I guess it should be something like "inferencing with them". Otherwise, it means that one would try to infer a complete LLM, which doesn't seem to make sense.
-- In Section 3.5, some datasets are introduced. However, I am missing hard facts with respect to the size of the datasets that go beyond general descriptions like "small". A table summarizing the main features of the datasets could be helpful.
-- In Section 4.1.3, the authors present criteria which LLMs are included in the evaluation. The model "solar-pro-preview-instruct" is excluded "since it only supports a context length of up to 4096k Tokens". I assume that the "k" is misplaced. While the selection of LLMs is explained in detail, the statement "Additionally, we included a Qwen3 and Llama4 model as well as two DeepSeek models." comes without any further explanation. It would be good if the authors could explain their rationale, why these models are included. Otherwise, the detailed explanation of the model selection does not make much sense to me.
-- Figure 5 shows box plots and seems to combine them with a representation of the arithmetic mean. 1) The authors should explicitly mention that it is the arithmetic mean, since a box plot also contains the median. 2) The circle that the authors use for the arithmetic mean always has a white background, regardless whether the circle is on a box, which has a colored background. It would be nice if this could be fixed (although it is not the most important issue).
-- It is stated that the "values behind these scores" (in Table 6) "have a high variance". Would it be possible to put the variance or standard deviation values into the Table? That would make it much easier to follow and more transparent.
- I would suggest that the font used in tables and figures should not be smaller than the font used in the text. All tables and nearly all figures seem to have smaller fonts, which make them hard to read. I understand the necessity to ensure that Tables 5 and 7 fit on a single page. However, I am not aware of a page limit for this journal article so I am wondering why the other tables and figures shouldn't receive the space that their content deserves.
- The related work section lists a long list of related publications. However, a comparison with the presented work is missing and should be added to emphasize the importance of LLM-KG-Bench.
- The paper briefly mentions "Task Classes and Parameterized Tasks". However, from the two sentences at that point, I only understood that task classes are used to organize tasks. I neither got the benefit of parameterized tasks (I guess any task may have some parameters) nor parameterized task classes. Later on, the authors dive down to implementation details and use the term "task class" there again, which makes me wonder whether I really understood task classes and I am sure that I didn't get parameterized task classes. So maybe the authors could improve the description of this part.
Overall, the writing itself is good but it would be beneficial if the paper could be proof read again. Some of the typos I found are listed at the end of the review.
## Open Science Data
The repeatability of the experiments seems to be good. However, there might be space for improvement w.r.t. the documentation.
- The framework itself is hosted as an open-source project on github and has a DOI on Zenodo. The installation and usage of the framework are described in the readme file. However, I didn't try to execute it. Further documentation is spread across several linked files. However, I miss a list of existing tasks (at least I couldn't find one). These task names seem to be necessary to configure the framework and, hence, a complete overview would be very helpful to attract more users.
- The experiment results are shared on github in a separate project. The project structure is document in the readme file.
## Conclusion
In my humble opinion, this is a good submission but it needs a minor revision, because of the problems pointed out before.
## List of smaller errors and suggestions
This part follows the pattern: Original text --> suggestion
### Abbreviations
While common abbreviations do not have to be introduced, several abbreviations used in this paper do not seem to be very common.
- p 4: "For open models, the Open-LLM-Leaderboard 3 provides a list of benchmark results with over 2,000 tested models, and with scores for IfEval, BBH, MATH, GPQU, MUSR, MMLU, plus a carbon dioxide emission estimate." --> The authors should either introduce the abbreviations or they should describe them. Just listing score names without any additional information does not seem to be very helpful.
- p 5: "using HumanEval, MultiPL-E, MBPP" --> the meaning of the first can be guessed but the other two remain unclear.
- p 5: "or RML generation (e.g.,14)." --> I had to look it up, since I was not sure what it could mean. It might be nice to introduce this abbreviation.
- p 6: "large language models (LLMs)" --> "LLMs"; it is not necessary to introduce an abbreviation twice
- p 6: I would suggest not introducing the abbreviations PLMs and NPLMs since both do not seem to be used again in the text.
- p19: The introduction of the abbreviations "SparqlSyntaxFixing(SSF), Sparql2Answer(S2A) and Text2Sparql(T2S)." comes too late since they have been already used on the previous pages. I would suggest introducing them directly when the task itself is defined.
### Typos, etc.
Text to remove is marked on the left, insertions are marked on the right with "[]", respectively.
- p 1: "LLMs.Finally" --> "LLMs.[ ]Finally"
- p 2: "Large Language Models (LLMs)[,] makes it difficult" --> "Large Language Models (LLMs) makes it difficult"
- p 3: "more than 40 current LLMs and an evaluation of the SPARQL SELECT related capabilities." --> "more than 40 current LLMs and an evaluation of the[ir] SPARQL-SELECT-related capabilities."
- p 3: "main concepts, API," --> either "APIs" or "its API"
- p 3: "model connectors and tasks" --> "model connectors and tasks[.]"
- p 3: "[providing] a unified source" --> "a unified source"
- p 4: "based on prompts given users" --> "based on prompts given [by] users"
- p 5: "Several leaderboards, like Big Code Models Leaderboard for Java, Javascript,and CPP, or EvalPlus Leaderboard for Python[,] assess the coding proficiency of LLMs using HumanEval, MultiPL-E, MBPP[, respectively]." --> "Several leaderboards, like [the] Big Code Models Leaderboard for Java, Javascript, and CPP, or [the] EvalPlus Leaderboard for Python assess the coding proficiency of LLMs using HumanEval, MultiPL-E, [and] MBPP." ("respectively" doesn't seem to fit here, since the first list contains 2 leaderboards while the second contains 3 elements)
- p 6: "accessing KG," --> "accessing KG[s]," or "accessing [a] KG,"
- p 7: "Dubey et al. extend the LC-QuAD dataset [with] LC-QuAD 2.0" --> Dubey et al. extend the LC-QuAD dataset [forming] LC-QuAD 2.0" (or a similar formulation)
- p 8: "infrastructure of LLM-KG-Bench framework" --> "infrastructure of [the] LLM-KG-Bench framework"
- p 8: "The main architecture is described in fig. 1, adapted version from Meyer et al.32." --> These two sentences contain several issues. I would suggest merging them into something like "Fig. 1 shows the architecture of the LLM-KG-Bench framework." The reference to [32] is not necessary, since the caption of the figure already refers to it.
- p10: "data can be fe[e]d into" --> "data can be fed into"
- p11: "Benchmark tasks in LLM-KG-Bench [do] implement" --> "Benchmark tasks in LLM-KG-Bench implement"
- p12: "prompt-answer-evaluate loop(fig. 2a)" --> "prompt-answer-evaluate loop[ ](fig. 2a)"
- p12: "the LLMs answer" --> "the LLM[']s answer"
- p15: "published [as well] another" --> "published another"
- p15: "fit into context size" --> "fit into [the] context size"
- P16: Table 3: column captions typically start with capital letters
- p17: "With the reevaluation feature the evaluation on these dialogs can get computed again with maybe updated code without new interaction with the LLMs." --> "The reevaluation feature allows to recalculate evaluation results without a new interaction with the LLMs." However, exactly the same statement already exists in Section 3. Hence, I would suggest to remove it from Section 4 since it is just repeating information that was already given and it doesn't seem to be of further importance to Section 4.
- p17: "data generated is6." --> "data generated is [presented by Heim et al.]6."
- p19: "For each task a box" --> "For each task[,] a box"
- p19: "Due to limited space, we show [here] a selection" --> "Due to limited space, we show a selection"
- p19: "Figures 5a and 5c show several LLMs seem to have no" --> "Figures 5a and 5c show several LLMs [that] seem to have no"
- p20: "Parameter" column --> "Parameters", "Parameter count", "Number of parameters", or "#Parameters"
- p20: "128k-1M" --> "128k--1M" (please use an n-dash instead of a hyphen for value ranges)
- p20: "235B(active 22B)" --> "235B[ ](active 22B)"
- p20: "As [can be seen in fig. 5b] several LLMs" --> "As [fig. 5b shows,] several LLMs"
- p21: Whitespaces are missing in Figure 5, e.g., "SparqlSyntaxFixing(SSF)" --> "SparqlSyntaxFixing[ ](SSF)"
- p21: "In fig. 5c a box plot" --> "In fig. 5c[,] a box plot"
- p21: "To get a quick overview on LLM models the" --> "To get a quick overview on LLM models[,] the"
- p22: "For the analysis of SPARQL capabilities we [decided to] use the following categories:" --> "For the analysis of SPARQL capabilities we use the following categories:"
- p22: All first sentences of the categories are not correct sentences, e.g., "Working with SPARQL SELECT queries with first answer."
- p22: "fig. 6 shows a" --> "Fig. 6 shows a"
- p27: "which does not necessary mean" --> "which does not necessar[il]y mean"
- p27: "And as the number of LLMs published and the size of datasets generated with the framework grow[s]" --> "And as the number of LLMs published and the size of datasets generated with the framework grow" (a number and a size = plural)
- p27: "highlight this findings" --> "highlight these findings"
- p27: "This way the evaluation could" --> "This way[,] the evaluation could"
### Suggestions
- I would suggest to use hyphens in compound adjectives since it eases the reading, e.g.:
- p 1: "SPARQL related" --> "SPARQL-related"
- p 3: "Knowledge Graph related" --> "Knowledge-Graph-related"
- p 3: "SPARQL SELECT query related" --> "SPARQL-SELECT-query-related"
- I would suggest sticking to a single writing of terms. For example:
- "Prompt-Answer-Evaluation loop" vs. "prompt-answer-evaluation loop" (e.g., Figure 2(a) vs. Figure 2(b) but also within the text)
- "LLM-KG-Bench" vs. "LLM-KG-bench"
- When referring to BigBench (e.g., on page 11 and following), the publication should be cited accordingly.
- In several cases, the citations are part of the sentence, e.g. "See also 31,32." Since the format of the journal seems to display the citations as endnotes, I would suggest mentioning the authors in these cases to make the sentences more readable, e.g., "See also Frey et al.31 and Meyer et al.32."
- Table 7 seems to be missing the \bottomrule, Table 6 is missing bottom and top rules.
|