A Framework for Assessing LLM Consistency in Knowledge Engineering

Tracking #: 3865-5079

Authors: 
Mohammad Javad Saeedizade
Reham Alharbi
Hamed Babaei Giglou
Anna Sofia Lippolis
Eva Blomqvist
Valentina Tamma
Floriana Grasso
Terry Payne
Jennifer D'Souza
Sören Auer
Andrea Giovanni Nuzzolese
Robin Keskisärkkä
Zebah Valeyil

Responsible editor: 
Cogan Shimizu

Submission type: 
Full Paper
Abstract: 
Consistency, i.e. the degree to which a system, process, or its results produce similar outcomes when repeated under identical or different conditions, is a critical concern in knowledge engineering (KE). This is particularly the case given the increasing reliance on Large Language Models (LLMs) in various tasks. This paper introduces CoLLM, a framework designed to assess whether a system or process produces consistent results in LLM-based KE tasks through three tests: (1) the LLM Repeatability Test, which evaluates the level of stochasticity or non-determinism of LLMs in existing studies; (2) the LLM Update Impact Test, which examines the effect that LLM updates may have on results; and (3) the LLM Replacement Test, which explores the effect of using alternative LLMs to perform the same study. Through 59 different experiments taken from five separate, recent studies, and leveraging various LLMs and datasets, we investigate the consistency of the results to empirically validate the reliability of the original findings for each study. Our investigation shows that in the majority of cases (81.4%), a consistent behaviour with respect to the original studies can be observed, despite some variability across the individual outputs. Additionally, in some cases, changing the choice of LLM can result in a consistent improvement across different metrics. These results demonstrate the viability of the proposed framework in general to assess the consistency of LLM-based KE tasks.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 15/Jul/2025
Suggestion:
Minor Revision
Review Comment:

[Summary] This paper introduces CoLLM, a framework designed to assess the consistency of LLMs in KE tasks. The authors examine consistency in three specific aspects: repeatability (same model and input producing the same output), impact of LLM updates (behavioral changes across versions), and LLM replacement (consistency across different model families). The paper reports results from 59 experiments from five recent KE studies that use LLMs for KE tasks.

[Review]
- The paper is well written and clearly structured, making it easy to follow. The topic is also relevant and interesting for the community. Addressing the issues and suggestions outlined below would benefit the overall quality of the work.

The related work section focuses mainly on defining terms and LLM consistency, but only briefly mentions one prior work on consistency of LLM across KE tasks. Given the growing number of studies comparing LLMs on the same KE inputs—often revealing performance differences—this seems insufficient. A broader review of such work would strengthen the paper.

In Section 3, the authors introduce n as the number of repeated runs for repeatability experiments. However, the value of n = 3 is only explained in Section 5. For clarity, it would be better to mention the value and then refer to Section 5 for justification.

Section 4:
• The pass/fail criteria used to assess consistency are not clearly described—it’s unclear what constitutes a failure or why you selected those thresholds.
• Consider adding a column to Table 3 for the specific KE task, or ensuring it is consistently mentioned in the “Selected Studies” column—some studies (e.g., OntoGenia) are missing this detail.
• Consider moving Table 4 earlier in the paper to improve the flow and readability.

Section 7:
• P16 line 48: ‘ in nine out of ten cases’ isn’t that 14/15 (93.3%) instead?
• I liked it that for some of the studies, you provided additional details on the rationale behind the inconsistencies from original experiments. However, this level of explanation is not consistent across all studies. The paper would benefit from including similar justifications for the other 2 studies that are missing.

Minor Issues and Typos
• p.7, line 46: The word “being” is used twice – revise.
• p.8, line 7: “OntoMetrics” is misspelled.
• p.8, line 8: “qualitative evaluation by expert ontology engineers” — is this multiple experts or a single expert?
• p.8, line 19: Citation appears as ‘[?]’.
• Subsection titles in Section 4 end with periods (.), which is inconsistent with other sections — standardize formatting.
• p.10, line 17: The word “utilities” should be “utilizes”.
• p.16, line 47; ‘most prior KE results replicates’ is not very understandable. Consider rephrasing (e.g., ‘the results from prior KE studies’) for clarity.
• p.16, line 50; Some brackets are missing for “(T_{upi}) and for ‘(T_{rpl})’.
• p.18, line 35; a space is missing ‘however,the study’s…”

Review #2
By Anelia Kurteva submitted on 20/Aug/2025
Suggestion:
Major Revision
Review Comment:

The paper presents a novel framework (CoLLM) for assessing LLM's consistency in knowledge engineering tasks. Instead of relying on a broad definition of "consistency," CoLLM offers a concrete, three-part framework to evaluate it. I found the paper interesting to follow but I think it can and should be improved a bit more before publication.

The introduction is well-written and effectively presents the background on the topic, motivating the work.

Table 1 is informative enough for the reader to distinguish between the different meanings of the terms. I would, however, remove either reproducibility or reproducing. The only difference I see is that one is a noun and the other a verb. The consistency's definition could be refined to reflect specific processes.

It might be useful for the reader if the authors specify at the start of the related work that the section discusses the definitions of the various terms. At first glance, I would have expected an overview of technical tools, benchmarks or metrics. For example, the authors can elaborate a bit more on the approach presented in [27] (page 5, line 30).

I find the flow between sections 2 and 3 a bit fractured. The methodology or approach for CoLLM is missing, as section 2 presents mainly a discussion on definitions.

Figure 1 presents “Experimental methodology” as part of CoLLM, which is a bit unclear. Is this the methodology for deriving the framework or running it?

It would be nice to see the years for each LLM version in Table 2. What is the most recent one or the oldest one?

The authors mention CoLLM consists of grounded tests as presented in Fig 1. However, the figure does not have a label for tests. The figure should better reflect that Trep, Tupi and Trpl are the tests.

In Table 4, the benchmark or baseline information is unclear. Do the authors mean 18 alignments, 18 triples or 18 CQs?

Page 10, line 48 – the last sentence is cut off by the following tables and gets mixed up with their captions making it difficult to follow.

The future directions section can be better elaborated on with more technical details about each test, examples, etc. Are there any other concerns, and/or why have these three been mentioned exactly?

Minor comments:
Page 3, lines 1-3 – Semantic Web should be consistently capitalised.
Page 5, lines 17-19 – Usually, paragraphs have more than 1 sentence.
Section 3.1 can include a brief introduction to what is presented later in the subsections.
Issue with reference in Table 3 -> Retrofit-CQs.
Page 9, line 6 -> missing reference or misplaced reference for Ontogenia.
Section 6.1 can benefit from a brief introduction.
All tables should have a consistent font size.
The references should be double checked for capitalisation of titles and publication venues(e.g. ref 14 “Iee” ->” IEEE”; [21,26…]).

Review #3
By Maria Angela Pellegrino submitted on 23/Aug/2025
Suggestion:
Major Revision
Review Comment:

This paper proposes CoLLM, a framework for assessing the consistency of Large Language Models (LLMs) in knowledge engineering. Consistency is evaluated through three tests: (i) the Repeatability Test, (ii) the Update Impact Test, and (iii) the Replacement Test. The authors conduct 59 experiments drawn from five recent studies in the literature. Results show that in over 80% of cases, outcomes are consistent, supporting the reliability of prior findings while highlighting cases of variability.

The contribution is timely and valuable, given the increasing role of LLMs in pipelines across knowledge engineering and related fields. The framework addresses an urgent need to test reproducibility, generalizability, and robustness under model updates and alternatives. However, certain aspects of presentation, methodology, and transparency could be strengthened to maximize the impact of this work.

Strengths
- Tackles an important and under-explored problem: assessing consistency of LLMs in KE tasks.
- Proposes a clear and structured framework (CoLLM) with three distinct tests that operationalize consistency.
- Provides an empirical evaluation across multiple studies and datasets, covering a variety of KE tasks.
- Makes the code available on GitHub, which is a step toward reproducibility. The GitHub repository includes a README file, which is adequate for orienting readers to the contents and understanding the structure of the data and code.

Weaknesses
Introduction and Framing
The introduction is engaging but overly “philosophical”; it delays discussion of the concrete contributions. Moreover, the concept of knowledge engineering tasks (ontology learning, ontology matching, etc.) is not introduced for non-expert readers, limiting accessibility.

Related Work / Background
The related work section is dominated by definitional discussions of reproducibility and consistency.
Actual comparisons with prior frameworks or evaluations of consistency are limited to a short final paragraph.
Coverage of related work is too narrow for a journal article; relevant studies from other fields (e.g., NLP reproducibility efforts, software engineering test frameworks) are missing.

Framework Description (CoLLM)
- The framework’s extensibility to alternative approaches or domains is not discussed. The usage mechanism of the code is not described (even in the repository), reducing transparency.
- The rationale in the Replacement Test (“based on some rationale, such as being newer”) is vague and not reproducible.

Methodology and Study Selection
- The selection process for the five studies lacks transparency. Inclusion criteria are listed, but the retrieval and filtering process is unclear. Without a systematic approach (e.g., inspired by literature review guidelines), study selection may appear subjective.
- The coverage of KE tasks is not analyzed — are key areas missing?

Results and Presentation
- Table 5 might benefit from the inclusion of the total number of executed tests.
- The paper should explicitly explain how to interpret green and red values, ideally with an example.
- Inconsistencies in reported results: the abstract states 81.4% consistent cases, while the conclusion mentions 85%.

Minor Issues
- Table 3 includes a broken citation for the fifth study.
- All prompts in Table 4 should be moved to supplementary material for coherence.
- Formatting of tables is inconsistent, mainly in sizes vary.

Suggestions for Improvement
- Restructure Introduction: After motivating the reproducibility challenge, move more quickly to outlining CoLLM and its contributions.
- Clarify Knowledge Engineering context: Briefly introduce KE tasks and their relevance for LLM-driven workflows, making the work accessible to a broader audience.
- Add a Background Section: Separate the definitional discussion of reproducibility and consistency into a dedicated background section. Leave “Related Work” for frameworks and prior studies (including other domains).
- Strengthen Study Selection Transparency: Either adopt a systematic selection method or explicitly acknowledge reliance on the authors’ expertise. Discuss coverage of KE tasks.
- Improve Framework Transparency: Provide clearer criteria for model replacement, and describe the usage of the CoLLM repository (how to run tests, extensibility).
- Refine Results Presentation: Add totals to Table 5, explain green/red markers with examples, and harmonize reported percentages across the paper.
- Fix Minor Issues reported above.

This paper makes an important and timely contribution to the Semantic Web and LLM research community.
The paper is original, presents sound and significant results, and is well written. The repository is appropriately hosted on GitHub and contains a README, with resources that are largely complete for replication.
With improvements in transparency (study selection, framework usage), clarity (introduction, tables), and coverage (related work, KE task diversity), it could become a strong and impactful journal article.