Review Comment:
This paper proposes CoLLM, a framework for assessing the consistency of Large Language Models (LLMs) in knowledge engineering. Consistency is evaluated through three tests: (i) the Repeatability Test, (ii) the Update Impact Test, and (iii) the Replacement Test. The authors conduct 59 experiments drawn from five recent studies in the literature. Results show that in over 80% of cases, outcomes are consistent, supporting the reliability of prior findings while highlighting cases of variability.
The contribution is timely and valuable, given the increasing role of LLMs in pipelines across knowledge engineering and related fields. The framework addresses an urgent need to test reproducibility, generalizability, and robustness under model updates and alternatives. However, certain aspects of presentation, methodology, and transparency could be strengthened to maximize the impact of this work.
Strengths
- Tackles an important and under-explored problem: assessing consistency of LLMs in KE tasks.
- Proposes a clear and structured framework (CoLLM) with three distinct tests that operationalize consistency.
- Provides an empirical evaluation across multiple studies and datasets, covering a variety of KE tasks.
- Makes the code available on GitHub, which is a step toward reproducibility. The GitHub repository includes a README file, which is adequate for orienting readers to the contents and understanding the structure of the data and code.
Weaknesses
Introduction and Framing
The introduction is engaging but overly “philosophical”; it delays discussion of the concrete contributions. Moreover, the concept of knowledge engineering tasks (ontology learning, ontology matching, etc.) is not introduced for non-expert readers, limiting accessibility.
Related Work / Background
The related work section is dominated by definitional discussions of reproducibility and consistency.
Actual comparisons with prior frameworks or evaluations of consistency are limited to a short final paragraph.
Coverage of related work is too narrow for a journal article; relevant studies from other fields (e.g., NLP reproducibility efforts, software engineering test frameworks) are missing.
Framework Description (CoLLM)
- The framework’s extensibility to alternative approaches or domains is not discussed. The usage mechanism of the code is not described (even in the repository), reducing transparency.
- The rationale in the Replacement Test (“based on some rationale, such as being newer”) is vague and not reproducible.
Methodology and Study Selection
- The selection process for the five studies lacks transparency. Inclusion criteria are listed, but the retrieval and filtering process is unclear. Without a systematic approach (e.g., inspired by literature review guidelines), study selection may appear subjective.
- The coverage of KE tasks is not analyzed — are key areas missing?
Results and Presentation
- Table 5 might benefit from the inclusion of the total number of executed tests.
- The paper should explicitly explain how to interpret green and red values, ideally with an example.
- Inconsistencies in reported results: the abstract states 81.4% consistent cases, while the conclusion mentions 85%.
Minor Issues
- Table 3 includes a broken citation for the fifth study.
- All prompts in Table 4 should be moved to supplementary material for coherence.
- Formatting of tables is inconsistent, mainly in sizes vary.
Suggestions for Improvement
- Restructure Introduction: After motivating the reproducibility challenge, move more quickly to outlining CoLLM and its contributions.
- Clarify Knowledge Engineering context: Briefly introduce KE tasks and their relevance for LLM-driven workflows, making the work accessible to a broader audience.
- Add a Background Section: Separate the definitional discussion of reproducibility and consistency into a dedicated background section. Leave “Related Work” for frameworks and prior studies (including other domains).
- Strengthen Study Selection Transparency: Either adopt a systematic selection method or explicitly acknowledge reliance on the authors’ expertise. Discuss coverage of KE tasks.
- Improve Framework Transparency: Provide clearer criteria for model replacement, and describe the usage of the CoLLM repository (how to run tests, extensibility).
- Refine Results Presentation: Add totals to Table 5, explain green/red markers with examples, and harmonize reported percentages across the paper.
- Fix Minor Issues reported above.
This paper makes an important and timely contribution to the Semantic Web and LLM research community.
The paper is original, presents sound and significant results, and is well written. The repository is appropriately hosted on GitHub and contains a README, with resources that are largely complete for replication.
With improvements in transparency (study selection, framework usage), clarity (introduction, tables), and coverage (related work, KE task diversity), it could become a strong and impactful journal article.
|