Review Comment:
-----Short Summary of the Work----
This work is an evaluation of LLMs capability over 4 well-defined ontology engineering tasks. The authors identify 4 problems in LLM and knowledge engineering areas: 1) underexplored ontology evaluation tasks; 2) limited comparative studies of LLMs 3) lack of benchmarks 4) challenges in evaluation for LLM capability. To address this, the authors’ work focuses on assessment of employing LLMs for the evaluation of ontology property restrictions. This work focuses on the use of existential, universal, and cardinality property restrictions. Four sub tasks are defined, including detection, classification, explanation, and correction of modelling issues in OWL ontologies. The benchmark dataset used in the evaluations is curated and annotated by experts based on student-built ontologies. The following LLMs are evaluated: GPT-4o, Claude Sonnet, DeepSeek V3, and Llama 3.3.
-------Originality------
The methodology employed during the benchmark construction and LLMs result evaluation makes a good contribution to the KE and LLMs research. The tasks and the assessment criteria are well defined. However, for task 2, I think the paper will be easier to read if the author explicitly mentions it is a multi-label classification task in the sub heading.
In Section 4.2, it is mentioned that 14 axioms are excluded from the gold standard. I think it will be helpful if the authors could explain the motivation to exclude 14 axioms out of 96 axioms (more than 14%) further and discuss the impact to the dataset size and results. Also, given that it is not included in the benchmark, it will be helpful if these excluded axioms could be released or disclosed in some format.
----Significance of the Results----
1. My first concern is related to reproducibility. As the authors’ response to the last round of reviews: “rather to establish a benchmark and define a set of ontology-evaluation tasks”.
It is clear that reproducibility and reusability should be crucial contributions for this work, and the methodology should be able to be reused.
However, this work is not straightforward to reproduce because of the non publicly available datasets and experts/human labour involved in the result evaluation, especially for Experiment 3 and Experiment 4.
For Experiment 3, as mentioned in the paper, 10% of the generated explanations were jointly evaluated and another 10% were assessed independently. This means, the remaining majority explanations, supposedly 80%, were evaluated by a single expert. For Experiment 4, 90% were assessed by a single expert. This raises a subjectivity concern.
Moreover, Fleiss’ kappa is typically preferred when there are 3 or more raters. It is not clear to me why the author chose Fleiss’ kappa for both Experiment 3 and Experiment 4. If all raters are given the same 10% portion of the generated explanations, Cohen's kappa is usually preferred for two raters. On another note, given that there are 29 cases and 4 LLMs in Experiment 3, I am not sure what the number of the 10% of the generated explanations (i.e., 10% of 116?) would be. It will be helpful if authors can explicitly disclose this.
I think this weakens the results.
2. Another major concern is related to Experiment 2 (Classifying an Ontology Modeling Issue). In Section 4.2, six main mistake types are identified, and category Other is also included for two different mistakes that occurred only once. However, LLM Prompt for Experiment 2 includes 7 types of mistakes. However, Experiment 2 evaluated six mistake types defined in section 4.2.
There is also a clear misalignment and terminology differences between the types in the prompts and in the Section 4.2 and evaluation.
For example, the third mistake type in the prompt: “Logical misunderstanding of the used constraint”. The same mistake is described as “Logical misunderstanding of the used property restriction”. The 6th mistakes in prompt is “Constraint placement at the wrong class”, it seems to correspond to “Property restriction placement at the wrong class”
The mistake “Logical under-restriction, caused by the Open World Assumption” in the prompt does not seem to be mentioned.
There were comments from the last round of review regarding “Property restriction”. I think terminology should be consistent throughout the paper. Moreover, since this wording is included in the prompts and will potentially impact the experimental results, the wording to report the result should be consistent with what LLMs actually have been prompted.
On another note, LLM Prompt for Experiment 2 does not include instructions for mistakes that do not correspond or an option for Other categories. Therefore, evaluation over Other categories is unreliable as choosing Other category does not align with instructions given.
Therefore, I think the results for Experiment 2 are not coherent between the prompts used, the experiments, and the reported results. This weakens the results and conclusion.
3. The problem statements (P1–P4) are discussed in the introduction. But they are not mentioned or discussed again until the conclusion. I think the links between these problems to research questions and experimental design and results are not clearly presented in the paper. Especially for P3, absence of benchmarks, as this paper does not position itself as benchmark resources paper and the dataset is not publicly available, I think this work is not well aligned with resolving P3.
4. Performance of llama 3 is not discussed in detail in either the Discussion or conclusion, I think it will be useful to discuss the model size’s impact on performance, as llama 3 is significantly smaller than all others models in this paper. It will also be useful to mention:
the size of DeepSeek V3. 2) whether reasoning is enabled for Sonnet 3) exact version of GPT-4o used
----Quality of Writing----
The overall writing is very good and is able to explain the key ideas very well. There are few minor issues that can be easily resolve:
Spelling of ‘modeling’ or ‘modelled’ – the British spelling and American spelling. The authors use ‘modeling’, which is American spelling, and ‘modelled’ which is the British spelling throughout the paper. I think it should be consistent in terms of style.
On page 2, line 40, The research questions 3 “...how they perform ontology evaluation?”, ontology evaluation is mentioned, while other research questions only mention “ontology restriction evaluation”.
-----Data or Resource----
The authors did not make the benchmark dataset publicly available. The source code for a tool used during a structured extraction workflow to build the corpus is available https://github.com/wu-semsys/ontology-analysis. This resource is easy to access and there is a README file to use.
The provided resources are not complete for replication of experiments. Because it is source code for processing the data. It does not include the student corpus source, curated the benchmark. or the code to run the 4 experiments.
|