From GPT to Mistral: Cross-Domain Ontology Learning with NeOn-GPT

Tracking #: 3859-5073

Authors: 
Nadeen Fathallah
Arunav Das
Stefano De Giorgis
Andrea Poltronieri
Peter Haase1
Liubov Kovriguina1
Elena Simperl
Albert Meroño-Peñuela
Steffen Staab
Alsayed Algergawy1

Responsible editor: 
Marta Sabou

Submission type: 
Full Paper
Abstract: 
We extend our previous work on NeOn-GPT, an LLM-powered ontology learning pipeline grounded in the NeOn methodology, by introducing methodological enhancements and broadening its evaluation across multiple domains and language models. We apply the pipeline to four diverse domains, for each domain, ontologies are generated using proprietary (GPT-4o) and open-source (Mistral) models. Evaluation is conducted against gold-standard ontologies using structural, lexical, and semantic metrics. Results demonstrate that LLMs can produce ontologies with high relational expressivity and partial conceptual alignment, though performance varies by domain and model.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 19/Jul/2025
Suggestion:
Accept
Review Comment:

The paper presents a well-written follow-up to the authors' initial work on NeOn-GPT, extending their ontology engineering pipeline that is augmented by the use of large language models. The approach demonstrates how ontology engineering can be partially automated through prompt (or context) engineering of various aspects of formalization of ontology components (classes, relations, individuals, and axioms). Particularly valuable is the extension of this work to a range of ontologies and evaluation of two LLMs, one closed source and one open source. The exposition is clear, and the evaluation is illuminating.

The paper makes several important contributions: clear demonstration of how LLM-based pipelines can support ontology engineering across diverse domains, comparative analysis between proprietary (GPT-4o) and open-source (Mistral) models, comprehensive evaluation framework incorporating structural, lexical, and semantic metrics, and well-articulated limitations and detailed future work section. I appreciated the care taken with spelling and grammar in this article, as many articles that I review are not so clean in that regard. In addition, the GitHub repository of experimental resources provided with the paper is well-organized and appears to be complete and to support reproducibility of the experiment results provided.

However, I have concerns about two fundamental aspects of the comparison between LLM-generated ontologies and the gold standard ontologies. First, it appears we are comparing the first pass of an LLM-generated ontology to "established expert-curated" gold standards that may have received considerable amounts of iterative refinement over time. While the statement in the discussion that the LLM outputs "serve as promising starting points" is borne out in the evaluations, it would be interesting to compare these outputs to some initial version of the gold standard ontologies, to make it more of an equal comparison. Second, it would be useful to see an evaluation that included multiple expert-curated ontologies for a given domain, to understand the issues of variation between those in conjunction with the variation between the GPT and Mistral ontologies, again, to provide a clearer comparison.

It would be valuable to see additional comments in the discussion section around the notion of iterative development of prompts and personae. Does this really expedite ontology development, or does it simply shift effort from traditional curation to iterative prompt engineering, particularly in the selection of examples for few-shot prompting? I would have liked to see more consideration of what a human/LLM hybrid process would be as a next step, building on the current automated pipeline approach.

A minor note: I found the visualizations of the class hierarchies in Section 6 impossible to read; hopefully in the final publication version this could be somehow addressed.

The limitations and next steps section are detailed and well-articulated. From a more philosophical perspective, it would be interesting for future work to consider the NeOn-GPT process from the perspective of social epistemology, as a hybrid, iterative, and social process of meaning making.

Despite the concerns raised above, I believe this article represents an important incremental step forward in approaching the use of LLMs in ontology engineering, and I believe it is a worthy contribution to the field.

Review #2
Anonymous submitted on 11/Oct/2025
Suggestion:
Major Revision
Review Comment:

*abstract
The abstract is clear and well structured but remains overly general. While it outlines the overall goal and setup effectively, it lacks sufficient methodological and contextual detail to assess novelty, rigor, and broader relevance.
- “methodological enhancements” to NeOn-GPT is not explained, leaving unclear what aspects were improved or newly introduced
- assumes readers are already familiar with the NeOn methodology- “Four diverse domains” is too general; indicating their types would clarify the claim of cross-domain applicability
- terms such as “high relational expressivity” and “partial conceptual alignment” are not defined or quantified
- the abstract does not make clear what is genuinely new in this extension relative to the earlier NeOn-GPT publication

*introduction
-it remains dense and assumes too much prior familiarity with the NeOn methodology and the previous NeOn-GPT paper
- several prior works on LLM-based ontology learning are cited but it does not articulate how the proposed approach specifically differs from or improves upon them. Clarifying this would help establish the paper’s novelty and unique contribution

*related work
Section 2.3 provides good coverage of the literature but insufficient analytical depth. Explicitly contrasting NeOn-GPT with key prior work. particularly [33] and [42], would better substantiate the originality and necessity of the proposed pipeline. In particular, works [33] and [42] are mentioned but never discussed in terms of their methodological contributions or limitations. It is unclear what aspects of these approaches are integrated into NeOn-GPT or why applying only those methods would yield different outcomes.- The statement that “many of which are integrated into our broader NeOn-GPT pipeline” is too vague. Specify which components are reused, how they are adapted, and in what sense NeOn-GPT generalizes or unifies them.- Without explicit differentiation, readers may perceive NeOn-GPT as a combination of existing ideas rather than a novel methodological framework. Adding a concise comparative paragraph
mehtodologyTo reach the standards of a mature methodological contribution, this section requires greater formalization, clarification of automation boundaries, explicit iteration logic, and transparent experimental configuration.- the authors should distill the workflow into an algorithmic representation (e.g., pseudocode or numbered procedure) summarizing control flow, key inputs, validation stages, and re-prompting logic-Define stopping criteria and how conflicts among validation layers (syntax vs consistency vs pitfalls) are resolved

-Several steps (persona refinement, ontology fragment selection, merging TTL outputs) appear manual. The methodology must explicitly indicate which components are automated, semi-automated, or human-curated. This distinction is essential for assessing scalability and reproducibility

-While role-play, few-shot, and chain-of-thought prompting are mentioned, no representative prompt structures or templates are provided
-The section should include generalized prompt templates (with placeholders) and explain how outputs from each stage feed subsequent prompts, as well as how token limits or truncation issues are handled.

-“Embedding ontology fragments as few-shot examples” is an interesting idea, but the selection, integration, and conflict-resolution procedures are not described. Specify
-How fragments are identified (manual curation, retrieval algorithm, or keyword matching)
-How overlap or duplicate entities are managed
-How reused IRIs are distinguished from newly created ones

-Claims such as “improved hierarchy construction” or “structured prompt engineering” remain qualitative.
-Provide quantitative indicators or criteria used to measure improvement (e.g., hierarchy depth, number of valid triples, error-reduction rate).

- Figure 2 is not mentioned in the text - Both figures (1 and 2) show the pipeline flow, but Figure 2 mostly repeats content from Figure 1 with minor structural differences.- Merge them into a single comprehensive figure or clearly differentiate their roles
- Put turtle examples in the listings to make the section clearer

*Sections 4 and 5 go together in one section
- Why were these specific structural, lexical, and semantic metrics selected?
-Who defined or validated the metric choices — the authors, domain experts, or a pre-existing standard?
-How were thresholds (e.g., 0.8 for lexical) determined? Were sensitivity analyses performed?
-Why was all-MiniLM-L6-v2 chosen as the embedding model for semantic similarity, and were alternatives tested?
-Do these metrics originate from prior NeOn-GPT work or from independent ontology evaluation benchmarks?
-How do these evaluation dimensions align with the NeOn methodology stages (e.g., structural -> modeling, semantic -> conceptualization)?
-How do the gold-standard ontologies differ in scope and granularity, and how do these differences affect metric comparability?
-Are reasoner-based quality metrics (e.g., unsatisfiable classes, OOPS! pitfalls) used quantitatively in the evaluation, or only procedurally during validation?
-Have you considered including competency question (CQ) answerability or expert judgment as extrinsic validation methods?

Suggestions:
-Add a short “Metric Justification” subsection explaining why each metric type was chosen and citing prior evaluation frameworks.
-include a brief comparison of evaluation approaches in recent LLM-based ontology learning papers.
-Report verification statistics (e.g., number of syntax/consistency/pitfall errors before vs. after refinement).
-Add a sensitivity analysis for threshold values
-Normalize results by ontology size or class count to make cross-domain comparisons fair.
-Incorporate a small expert-based or CQ-based validation to complement automated metrics.

*Section 6
What quantitative baseline are the results compared against?
How were the domain differences controlled?
How should readers interpret “high relational expressivity” and “partial conceptual alignment” in numerical terms?
What is the statistical reliability of the results? Were multiple runs made per model/domain, or are results from a single generation instance? Are variations across runs (due to stochasticity in LLMs) reported?
Are improvements statistically significant or just descriptive?
How do GPT-4o and Mistral differ qualitatively? What characteristics of each model (context length, reasoning, fine-tuning) explain the observed gaps?
Were any failure cases analyzed? For instance, types of ontology elements most prone to errors (missing axioms, incorrect domains/ranges).
What was the impact of reuse mechanisms on the results? Do reused fragments measurably improve lexical or semantic similarity?
Are there evaluation inconsistencies between the four domains (e.g., gold-standard completeness or differing CQ coverage)?
Does the verification stage (syntax / consistency / pitfall repair) quantitatively reduce error counts? If yes, how large is the reduction, and how many re-prompting iterations were typically needed?

*Sections 7 and 9 go together in one section-treating them jointly is not only more concise but also improves coherence by linking interpretation directly to critical reflection.
Sections 7 and 9 remain largely descriptive and do not clearly explain why certain patterns emerge or how the proposed methodological enhancements contribute to them. The authors mention variability across domains and models but provide limited analytical reflection on its causes or implications. Moreover, key aspects such as reproducibility, stochastic behavior, human intervention cost, and possible bias from LLM pretraining remain unaddressed.
-Which specific methodological enhancements (prompt engineering, reuse, validation loop) most contributed to the observed gains, and how was this determined?
-How robust are these findings across different domains and LLMs?
-What are the main failure modes identified (e.g., hallucinations, shallow hierarchies, incorrect axioms), and how do they align with the quantitative evaluation?
-How reproducible are the outcomes given the stochastic nature of LLMs? Were multiple runs averaged or manually filtered?
-To what extent do the verification steps (syntax, consistency, pitfalls) mitigate but not eliminate systemic issues?
-What is the human effort cost, in persona design, ontology fragment selection, or output correction, and does it limit scalability?
-Are there risks of ontology drift or bias introduced by the LLM’s pretraining data, and how are these mitigated?
-Could the authors discuss ethical or epistemic considerations (e.g., reproducibility, attribution, ontology ownership) in automated knowledge generation?
-How might these limitations affect real-world deployment, especially in sensitive domains such as biomedical or environmental knowledge?
-What concrete future steps are planned to address the identified issues (e.g., integration with active learning, fine-tuning, or human–LLM co-design)?

*Sections 8 and 10 
-Combine Sections 8 and 10 into a unified “Conclusions and Future Directions” section that blends reflection and projection.

Review #3
Anonymous submitted on 31/Oct/2025
Suggestion:
Major Revision
Review Comment:

This paper extends previous work on NeOn-GPT, an LLM-powered ontology learning pipeline based on the NeOn methodology. The authors introduce methodological enhancements and a broader empirical evaluation across multiple domains and language models (GPT-4o and Mistral). The goal is to demonstrate that LLMs can produce syntactically valid and semantically coherent ontologies.

While the topic is timely and relevant to the Semantic Web community, several critical issues remain concerning methodological choices, evaluation design, reproducibility, and the paper’s positioning with respect to existing literature.

## Comments

1. Choice of Methodology (NeOn vs. LOT): My first concern is the rationale for using NeOn rather than LOT. The LOT methodology was designed for industrial and lightweight ontology engineering processes and would arguably provide a more realistic test of whether LLMs can facilitate ontology creation in practice. NeOn, on the other hand, is complex and heavy-weight, better suited for large-scale, structured projects with human oversight. The authors should justify why NeOn was chosen and evaluate whether LOT would fit better with an automated or semi-automated LLM-driven process.

2. Reproducibility and Model Selection: A serious issue concerns reproducibility. The paper evaluates the pipeline using GPT-4o, a proprietary model that is no longer accessible. This makes the reported results non-reproducible and non-verifiable by other researchers.
If the goal is to assess the general capability of LLMs for ontology generation, the experiments should rely primarily on open-source models (e.g., Llama 3, DeepSeek, Mistral, Gemma, or GPT-OSS variants) and ideally include more than two. Comparing at least three or four open models would provide stronger, reproducible evidence of performance differences and model-agnostic behavior. At the very least, the authors should discuss the limitations of relying on a closed model and clarify whether their prompts, outputs, and evaluation scripts can be fully reproduced with open models.

3. Related Work Section: Much of the related work reads as general background rather than a critical positioning. The article should clearly state how NeOn-GPT differs from existing ontology generation frameworks. Without a comparative discussion of methods and contributions, it is hard to assess novelty.

4. Methodology and Role of LLMs: While the methodology section carefully follows NeOn’s phases, it is not clear why an LLM is required for each. For example, ontology implementation could be performed automatically via scripts or mapping tools (e.g., RML) without prompting an LLM (transforming a table of spo into OWL is not very hard). The current pipeline remains very manual, most stages depend on few-shot prompting and domain-specific examples that must be crafted by an expert. It is possible that a more advanced agent-based or hybrid architecture (combining symbolic and data-driven components) could achieve the same goals more efficiently.

5. Actual Utility for Knowledge Engineers: It remains unclear how much this approach helps ontology engineers in practice. Beyond acting as an assistant, there is no quantitative evidence that the process reduces effort, errors, or development time. A user study measuring the time and quality impact on ontology developers, both in simple and complex domains, would greatly strengthen the claims.

6. Independence from Previous Work: Although the paper lists the differences with the prior NeOn-GPT publication, this new version cannot be fully understood without reading the original one. The added contributions should be explicitly summarized and contextualized to make the paper self-contained.

7. Figures, Tables, and References: Several figure and table references appear incorrect or inconsistent. Cross-referencing should be verified carefully.

8. Results and Analysis: The results section mostly restates the figures and tables rather than analyzing it. The discussion could be more concise and focus on interpretation, trends, and implications. Many detailed results could be moved to an appendix. Instead of exhaustive reporting, consider summarizing findings using aggregated statistics or visualization metrics.
Although the paper employs numerous structural, lexical, and semantic metrics, there is no composite measure summarizing overall alignment with the gold standard. Introducing an aggregated metric such as an ontology-level similarity, would simplify interpretation.
Additionally, the metrics should be justified with references to prior ontology evaluation frameworks to ensure methodological rigor.

9. Ablation Study: An ablation study is missing. It would be valuable to analyze the contribution of each methodological step (e.g., reuse, validation, enrichment) to the final ontology quality, and to examine performance when individual components are removed or modified.

10. Conclusions and Structure: The paper’s final sections (7–10) could be streamlined. The conclusions correctly acknowledge that the system functions more as an assistant requiring expert intervention. However, the key question remains unanswered: *How much does this approach accelerate or simplify ontology construction for real practitioners?* This is central to assessing the impact of the proposed approach.

Given the issues above, I recommend a major revision before publication.