Review Comment:
*abstract
The abstract is clear and well structured but remains overly general. While it outlines the overall goal and setup effectively, it lacks sufficient methodological and contextual detail to assess novelty, rigor, and broader relevance.
- “methodological enhancements” to NeOn-GPT is not explained, leaving unclear what aspects were improved or newly introduced
- assumes readers are already familiar with the NeOn methodology- “Four diverse domains” is too general; indicating their types would clarify the claim of cross-domain applicability
- terms such as “high relational expressivity” and “partial conceptual alignment” are not defined or quantified
- the abstract does not make clear what is genuinely new in this extension relative to the earlier NeOn-GPT publication
*introduction
-it remains dense and assumes too much prior familiarity with the NeOn methodology and the previous NeOn-GPT paper
- several prior works on LLM-based ontology learning are cited but it does not articulate how the proposed approach specifically differs from or improves upon them. Clarifying this would help establish the paper’s novelty and unique contribution
*related work
Section 2.3 provides good coverage of the literature but insufficient analytical depth. Explicitly contrasting NeOn-GPT with key prior work. particularly [33] and [42], would better substantiate the originality and necessity of the proposed pipeline. In particular, works [33] and [42] are mentioned but never discussed in terms of their methodological contributions or limitations. It is unclear what aspects of these approaches are integrated into NeOn-GPT or why applying only those methods would yield different outcomes.- The statement that “many of which are integrated into our broader NeOn-GPT pipeline” is too vague. Specify which components are reused, how they are adapted, and in what sense NeOn-GPT generalizes or unifies them.- Without explicit differentiation, readers may perceive NeOn-GPT as a combination of existing ideas rather than a novel methodological framework. Adding a concise comparative paragraph
mehtodologyTo reach the standards of a mature methodological contribution, this section requires greater formalization, clarification of automation boundaries, explicit iteration logic, and transparent experimental configuration.- the authors should distill the workflow into an algorithmic representation (e.g., pseudocode or numbered procedure) summarizing control flow, key inputs, validation stages, and re-prompting logic-Define stopping criteria and how conflicts among validation layers (syntax vs consistency vs pitfalls) are resolved
-Several steps (persona refinement, ontology fragment selection, merging TTL outputs) appear manual. The methodology must explicitly indicate which components are automated, semi-automated, or human-curated. This distinction is essential for assessing scalability and reproducibility
-While role-play, few-shot, and chain-of-thought prompting are mentioned, no representative prompt structures or templates are provided
-The section should include generalized prompt templates (with placeholders) and explain how outputs from each stage feed subsequent prompts, as well as how token limits or truncation issues are handled.
-“Embedding ontology fragments as few-shot examples” is an interesting idea, but the selection, integration, and conflict-resolution procedures are not described. Specify
-How fragments are identified (manual curation, retrieval algorithm, or keyword matching)
-How overlap or duplicate entities are managed
-How reused IRIs are distinguished from newly created ones
-Claims such as “improved hierarchy construction” or “structured prompt engineering” remain qualitative.
-Provide quantitative indicators or criteria used to measure improvement (e.g., hierarchy depth, number of valid triples, error-reduction rate).
- Figure 2 is not mentioned in the text - Both figures (1 and 2) show the pipeline flow, but Figure 2 mostly repeats content from Figure 1 with minor structural differences.- Merge them into a single comprehensive figure or clearly differentiate their roles
- Put turtle examples in the listings to make the section clearer
*Sections 4 and 5 go together in one section
- Why were these specific structural, lexical, and semantic metrics selected?
-Who defined or validated the metric choices — the authors, domain experts, or a pre-existing standard?
-How were thresholds (e.g., 0.8 for lexical) determined? Were sensitivity analyses performed?
-Why was all-MiniLM-L6-v2 chosen as the embedding model for semantic similarity, and were alternatives tested?
-Do these metrics originate from prior NeOn-GPT work or from independent ontology evaluation benchmarks?
-How do these evaluation dimensions align with the NeOn methodology stages (e.g., structural -> modeling, semantic -> conceptualization)?
-How do the gold-standard ontologies differ in scope and granularity, and how do these differences affect metric comparability?
-Are reasoner-based quality metrics (e.g., unsatisfiable classes, OOPS! pitfalls) used quantitatively in the evaluation, or only procedurally during validation?
-Have you considered including competency question (CQ) answerability or expert judgment as extrinsic validation methods?
Suggestions:
-Add a short “Metric Justification” subsection explaining why each metric type was chosen and citing prior evaluation frameworks.
-include a brief comparison of evaluation approaches in recent LLM-based ontology learning papers.
-Report verification statistics (e.g., number of syntax/consistency/pitfall errors before vs. after refinement).
-Add a sensitivity analysis for threshold values
-Normalize results by ontology size or class count to make cross-domain comparisons fair.
-Incorporate a small expert-based or CQ-based validation to complement automated metrics.
*Section 6
What quantitative baseline are the results compared against?
How were the domain differences controlled?
How should readers interpret “high relational expressivity” and “partial conceptual alignment” in numerical terms?
What is the statistical reliability of the results? Were multiple runs made per model/domain, or are results from a single generation instance? Are variations across runs (due to stochasticity in LLMs) reported?
Are improvements statistically significant or just descriptive?
How do GPT-4o and Mistral differ qualitatively? What characteristics of each model (context length, reasoning, fine-tuning) explain the observed gaps?
Were any failure cases analyzed? For instance, types of ontology elements most prone to errors (missing axioms, incorrect domains/ranges).
What was the impact of reuse mechanisms on the results? Do reused fragments measurably improve lexical or semantic similarity?
Are there evaluation inconsistencies between the four domains (e.g., gold-standard completeness or differing CQ coverage)?
Does the verification stage (syntax / consistency / pitfall repair) quantitatively reduce error counts? If yes, how large is the reduction, and how many re-prompting iterations were typically needed?
*Sections 7 and 9 go together in one section-treating them jointly is not only more concise but also improves coherence by linking interpretation directly to critical reflection.
Sections 7 and 9 remain largely descriptive and do not clearly explain why certain patterns emerge or how the proposed methodological enhancements contribute to them. The authors mention variability across domains and models but provide limited analytical reflection on its causes or implications. Moreover, key aspects such as reproducibility, stochastic behavior, human intervention cost, and possible bias from LLM pretraining remain unaddressed.
-Which specific methodological enhancements (prompt engineering, reuse, validation loop) most contributed to the observed gains, and how was this determined?
-How robust are these findings across different domains and LLMs?
-What are the main failure modes identified (e.g., hallucinations, shallow hierarchies, incorrect axioms), and how do they align with the quantitative evaluation?
-How reproducible are the outcomes given the stochastic nature of LLMs? Were multiple runs averaged or manually filtered?
-To what extent do the verification steps (syntax, consistency, pitfalls) mitigate but not eliminate systemic issues?
-What is the human effort cost, in persona design, ontology fragment selection, or output correction, and does it limit scalability?
-Are there risks of ontology drift or bias introduced by the LLM’s pretraining data, and how are these mitigated?
-Could the authors discuss ethical or epistemic considerations (e.g., reproducibility, attribution, ontology ownership) in automated knowledge generation?
-How might these limitations affect real-world deployment, especially in sensitive domains such as biomedical or environmental knowledge?
-What concrete future steps are planned to address the identified issues (e.g., integration with active learning, fine-tuning, or human–LLM co-design)?
*Sections 8 and 10
-Combine Sections 8 and 10 into a unified “Conclusions and Future Directions” section that blends reflection and projection.
|