Knowledge Engineering with Large Language Models: A Capability Assessment in Evaluating Ontology Property Restrictions

Tracking #: 3980-5194

Authors: 
Stefani Tsaneva
Guntur Budi Herwanto
Majlinda Llugiqi
Marta Sabou

Responsible editor: 
Guest Editors 2025 LLM GenAI KGs

Submission type: 
Full Paper
Abstract: 
Advancements in large language models (LLMs) offer opportunities for automating challenging and time-intensive Knowledge Engineering (KE) tasks. Constructing an ontology is a complex process, particularly when logical restrictions are modeled or when the development is performed by novice knowledge engineers or domain experts with limited training in KE. Consequently, developed ontologies often contain modeling errors, undermining the success of ontology-based applications and hindering subsequent KE tasks. Thus, it is important to investigate how LLMs can support KE tasks such as the evaluation of ontologies, which includes, among other tasks (e.g., inconsistency detection, competency question alignment, etc.) the detection and correction of errors in knowledge-based resources. However, challenges remain in systematically evaluating LLM performance and comparing different models in terms of their capabilities to perform concrete KE tasks. Moreover, there is a lack of comprehensive, task-specific benchmarks needed for such LLM capability assessments. As a result, selecting the right LLM to effectively support knowledge engineers presents a nontrivial problem. To fill these gaps, this study investigates how and to what extent LLMs can support four concrete (but not exhaustive) ontology evaluation sub-tasks: the detection, classification, explanation, and possible correction of modeling issues in OWL ontologies, focusing on the use of existential, universal, and cardinality property restrictions. To this end, we construct a benchmark dataset based on student-built ontologies and perform experimental assessments of the performance of four LLMs--GPT-4o, Claude Sonnet, DeepSeek V3, and Llama 3.3-- on these four KE sub-tasks. Additionally, we exemplify the definition of an annotation framework for the qualitative evaluation of LLM outputs and perform a comparative analysis of each model's capabilities. Our findings reveal notable differences in model behavior and task-specific strengths, underscoring the importance of selecting the most appropriate model for a concrete KE task.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 03/Jan/2026
Suggestion:
Accept
Review Comment:

After carefully considering the authors’ responses to my previous comments and re-reading the revised manuscript, I am convinced that the paper has been substantially improved. The authors have addressed my concerns thoroughly and thoughtfully within the paper and cover letter.

In particular, the authors provide clear justification for the scope of the evaluated LLMs and clearly position the study’s goal as the establishment of a reusable benchmark and evaluation methodology rather than an exhaustive LLM comparison. The clarifications regarding the expert-based evaluation procedures, assessment criteria, and reproducibility considerations significantly strengthen the methodological rigor of the work.

Overall, the proposed capability assessment framework advances the evaluation of ontology-related tasks and addresses an important gap in current research. This direction is especially timely and relevant given the growing body of work in knowledge engineering that leverages large language models.

I appreciate the author's careful revisions and detailed responses, and I have no remaining concerns about accepting the paper for publication in SWJ.

Review #2
By Bohui Zhang submitted on 23/Mar/2026
Suggestion:
Accept
Review Comment:

This paper makes a clear and useful contribution by defining four concrete ontology evaluation sub-tasks for LLMs, constructing an expert-annotated benchmark from student ontologies, and conducting a comparative assessment across four representative models. The study is well motivated, methodologically sound, and addresses an important gap in evaluating LLMs for knowledge engineering. The experimental analysis is informative and practically relevant, showing distinct model-specific strengths across detection, classification, explanation, and correction, while also yielding actionable guidance for ontology evaluation workflows. I recommend acceptance.

Review #3
Anonymous submitted on 29/Mar/2026
Suggestion:
Minor Revision
Review Comment:

-----Short Summary of the Work----
This work is an evaluation of LLMs capability over 4 well-defined ontology engineering tasks. The authors identify 4 problems in LLM and knowledge engineering areas: 1) underexplored ontology evaluation tasks; 2) limited comparative studies of LLMs 3) lack of benchmarks 4) challenges in evaluation for LLM capability. To address this, the authors’ work focuses on assessment of employing LLMs for the evaluation of ontology property restrictions. This work focuses on the use of existential, universal, and cardinality property restrictions. Four sub tasks are defined, including detection, classification, explanation, and correction of modelling issues in OWL ontologies. The benchmark dataset used in the evaluations is curated and annotated by experts based on student-built ontologies. The following LLMs are evaluated: GPT-4o, Claude Sonnet, DeepSeek V3, and Llama 3.3.

-------Originality------

The methodology employed during the benchmark construction and LLMs result evaluation makes a good contribution to the KE and LLMs research. The tasks and the assessment criteria are well defined. However, for task 2, I think the paper will be easier to read if the author explicitly mentions it is a multi-label classification task in the sub heading.

In Section 4.2, it is mentioned that 14 axioms are excluded from the gold standard. I think it will be helpful if the authors could explain the motivation to exclude 14 axioms out of 96 axioms (more than 14%) further and discuss the impact to the dataset size and results. Also, given that it is not included in the benchmark, it will be helpful if these excluded axioms could be released or disclosed in some format.

----Significance of the Results----

1. My first concern is related to reproducibility. As the authors’ response to the last round of reviews: “rather to establish a benchmark and define a set of ontology-evaluation tasks”.
It is clear that reproducibility and reusability should be crucial contributions for this work, and the methodology should be able to be reused.

However, this work is not straightforward to reproduce because of the non publicly available datasets and experts/human labour involved in the result evaluation, especially for Experiment 3 and Experiment 4.

For Experiment 3, as mentioned in the paper, 10% of the generated explanations were jointly evaluated and another 10% were assessed independently. This means, the remaining majority explanations, supposedly 80%, were evaluated by a single expert. For Experiment 4, 90% were assessed by a single expert. This raises a subjectivity concern.

Moreover, Fleiss’ kappa is typically preferred when there are 3 or more raters. It is not clear to me why the author chose Fleiss’ kappa for both Experiment 3 and Experiment 4. If all raters are given the same 10% portion of the generated explanations, Cohen's kappa is usually preferred for two raters. On another note, given that there are 29 cases and 4 LLMs in Experiment 3, I am not sure what the number of the 10% of the generated explanations (i.e., 10% of 116?) would be. It will be helpful if authors can explicitly disclose this.

I think this weakens the results.

2. Another major concern is related to Experiment 2 (Classifying an Ontology Modeling Issue). In Section 4.2, six main mistake types are identified, and category Other is also included for two different mistakes that occurred only once. However, LLM Prompt for Experiment 2 includes 7 types of mistakes. However, Experiment 2 evaluated six mistake types defined in section 4.2.

There is also a clear misalignment and terminology differences between the types in the prompts and in the Section 4.2 and evaluation.

For example, the third mistake type in the prompt: “Logical misunderstanding of the used constraint”. The same mistake is described as “Logical misunderstanding of the used property restriction”. The 6th mistakes in prompt is “Constraint placement at the wrong class”, it seems to correspond to “Property restriction placement at the wrong class”
The mistake “Logical under-restriction, caused by the Open World Assumption” in the prompt does not seem to be mentioned.

There were comments from the last round of review regarding “Property restriction”. I think terminology should be consistent throughout the paper. Moreover, since this wording is included in the prompts and will potentially impact the experimental results, the wording to report the result should be consistent with what LLMs actually have been prompted.

On another note, LLM Prompt for Experiment 2 does not include instructions for mistakes that do not correspond or an option for Other categories. Therefore, evaluation over Other categories is unreliable as choosing Other category does not align with instructions given.
Therefore, I think the results for Experiment 2 are not coherent between the prompts used, the experiments, and the reported results. This weakens the results and conclusion.

3. The problem statements (P1–P4) are discussed in the introduction. But they are not mentioned or discussed again until the conclusion. I think the links between these problems to research questions and experimental design and results are not clearly presented in the paper. Especially for P3, absence of benchmarks, as this paper does not position itself as benchmark resources paper and the dataset is not publicly available, I think this work is not well aligned with resolving P3.

4. Performance of llama 3 is not discussed in detail in either the Discussion or conclusion, I think it will be useful to discuss the model size’s impact on performance, as llama 3 is significantly smaller than all others models in this paper. It will also be useful to mention:
the size of DeepSeek V3. 2) whether reasoning is enabled for Sonnet 3) exact version of GPT-4o used

----Quality of Writing----
The overall writing is very good and is able to explain the key ideas very well. There are few minor issues that can be easily resolve:
Spelling of ‘modeling’ or ‘modelled’ – the British spelling and American spelling. The authors use ‘modeling’, which is American spelling, and ‘modelled’ which is the British spelling throughout the paper. I think it should be consistent in terms of style.
On page 2, line 40, The research questions 3 “...how they perform ontology evaluation?”, ontology evaluation is mentioned, while other research questions only mention “ontology restriction evaluation”.

-----Data or Resource----
The authors did not make the benchmark dataset publicly available. The source code for a tool used during a structured extraction workflow to build the corpus is available https://github.com/wu-semsys/ontology-analysis. This resource is easy to access and there is a README file to use.
The provided resources are not complete for replication of experiments. Because it is source code for processing the data. It does not include the student corpus source, curated the benchmark. or the code to run the 4 experiments.