Knowledge Engineering with Large Language Models: A Capability Assessment in Ontology Evaluation

Tracking #: 3852-5066

This paper is currently under review
Authors: 
Stefani Tsaneva
Guntur Budi Herwanto
Majlinda Llugiqi
Marta Sabou

Responsible editor: 
Guest Editors 2025 LLM GenAI KGs

Submission type: 
Full Paper
Abstract: 
Advancements in large language models (LLMs) offer opportunities for automating challenging and time-intensive Knowledge Engineering (KE) tasks. Constructing an ontology is a complex process, particularly when logical restrictions are modeled or when the development is performed by novice knowledge engineers or domain experts with limited training in KE. Consequently, developed ontologies often contain modeling errors, undermining the success of ontology-based applications and hindering subsequent KE tasks. Thus, it is important to investigate how LLMs can support KE tasks such as the evaluation of ontologies, involving the detection and correction of errors in knowledge-based resources. However, challenges remain in systematically evaluating LLM performance and comparing different models in terms of their capabilities to perform concrete KE tasks. Moreover, there is a lack of comprehensive, task-specific benchmarks needed for such LLM capability assessments. As a result, selecting the right LLM to effectively support knowledge engineers presents a nontrivial problem. To fill these gaps, this study investigates how and to what extent LLMs can support four concrete ontology evaluation sub-tasks: the detection, classification, explanation, and possible correction of modeling issues in ontologies, focusing on the use of existential, universal, and cardinality constraints. To this end, we construct a benchmark dataset based on student-built ontologies and perform experimental assessments of the performance of four LLMs--GPT-4o, Claude Sonnet, DeepSeek V3, and Llama 3.3-- on these four KE sub-tasks. Additionally, we exemplify the definition of an annotation framework for the qualitative evaluation of LLM outputs and perform a comparative analysis of each model's capabilities. Our findings reveal notable differences in model behavior and task-specific strengths, underscoring the importance of selecting the most appropriate model to support concrete KE tasks.
Full PDF Version: 
Tags: 
Under Review