Ontology-driven context engineering for semantically-aware chatbot responses

Tracking #: 4002-5216

Authors: 
Mateus Peixoto
Gabriel Ferreira
Lucas Gomes Maddalena
Fernanda Araujo Baião

Responsible editor: 
Guest Editors ML and KR 2025

Submission type: 
Full Paper
Abstract: 
Large Language Models (LLMs) have achieved notable success across diverse domains, yet they remain prone to hallucinations and omissions due to their probabilistic reasoning and black-box nature. These limitations are particularly critical in high-stakes domains, where inaccurate or incomplete responses may lead to severe consequences, for example, failing to instruct a client correctly in an account activation procedure, despite having the acess to the information. Recent efforts to mitigate such issues have focused on integrating structured knowledge into LLM pipelines, predominantly through knowledge graphs. However, the selection of the relevant entities to generate a response may not be intuitive to retrieve from the graph's structure alone. The role of deeper semantic structures has not been systematically explored. Specifically, ontologies with axiomatic foundations could be used for automated inference, leading to a more precise set of entities, and therefore potentially providing a better input to generate more accurate LLM responses. This paper investigates the impact of deep semantic enrichment on LLM-based knowledge extraction and response generation. We situate our contribution within the Neuro-Symbolic AI paradigm and propose a framework that combines knowledge graphs for contextual storage with ontological reasoning grounded in structural patterns that are ontologically well-founded in UFO. By leveraging the well-founded semantic nature of ontological entities, rather than solely their graph-like structure, the proposed approach constrains interpretation and supports more precise reasoning during prompt construction. The framework is evaluated by a real company using a domain ontology of the Brazilian financial system, enriched with business rules and processes provided by an industry partner. We compare scenarios with varying strategies for leveraging semantic knowledge from the ontology, ranging from light-semantics approaches that rely only on graph-based properties, as well as a problem-based baseline designed by domain experts. Experimental results demonstrate that the Deep Semantics approach achieves superior performance in both entity extraction and response generation quality, consistently extracting a higher number of relevant entities and producing more complete and semantically aligned responses. These findings highlight the benefits of incorporating ontological depth and axiomatic semantics into LLM workflows. Overall, this work provides empirical evidence that deeper knowledge representations can significantly enhance LLM reliability and interpreting ability, advancing the development of knowledge-driven large language models.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Christophe Cruz submitted on 03/Feb/2026
Suggestion:
Minor Revision
Review Comment:

Review: Ontology-Driven Context Engineering for Semantically Enhanced Chatbot Responses

Summary

This paper presents a framework for the semantic enrichment of chatbot responses using OntoUML ontology from the financial domain. The proposed approach consists of nine components aimed at improving the quality and accuracy of bot-generated answers. The authors evaluate their system using eight different linking approaches and measure performance based on completeness, correctness, and False Positive Rate.

Strengths

The paper addresses a highly relevant and timely topic. The use of OntoUML ontology to semantically enhance chatbot responses is an innovative approach that demonstrates significant potential for improving conversational AI systems.

The authors clearly articulate the challenges motivating this research, including problem comprehension limitations, hallucination issues, complex error handling, operational rule enforcement, and mistaken response repetition. This thorough problem statement effectively contextualizes the proposed solution.

The methodological rigor is commendable. By presenting eight distinct approaches for the entity extraction and linking process, the authors provide a comprehensive comparative analysis. The evaluation metrics, namely completeness, correctness, and False Positive Rate, offer a multidimensional assessment of system quality. The results suggest that deep semantics outperforms light semantics approaches when considering the trade-off between correctness and False Positive Rate.

Weaknesses and Required Revisions

Major Issues

Regarding the prompt generation process, the paper describes a framework with nine components; however, the prompt generation process is not adequately discussed. This omission leaves a gap in understanding the complete pipeline.

Concerning human expert involvement, the critical role of human specialists in qualifying the dataset should be explicitly acknowledged, as this significantly impacts the results. Additionally, the OntoUML enrichment process appears to be time-consuming for domain experts, which raises concerns about scalability and practical deployment.

The section on Foundation Ontology Patterns in OntoUML is somewhat abstruse, making it difficult to understand and to appreciate its relevance to the presented work. This section requires clarification and improved exposition.

The reported False Positive Rate values exceeding 100% are confusing and appear inconsistent with the standard definition of False Positive Rate, which should range between 0% and 100%. This requires clarification or correction in the results discussion.

The ranking algorithm used in Table 2 is not explained, making it difficult to interpret the relevance and validity of the presented rankings across different sessions.

Minor Issues

The acronym DROP is used but never defined. Figure 3.b appears to be empty or missing content. A reference is made to Section 2, but no section numbering is present in the manuscript.

Recommendation

Accept with minor revisions.

Overall, this paper makes a valuable contribution to the field of semantically enhanced conversational agents. The research is well-motivated, methodologically sound, and presents interesting findings regarding the superiority of deep semantic approaches. However, the issues outlined above, particularly the clarification of the False Positive Rate values, the ranking methodology, and the Foundation Ontology Patterns section, must be addressed before publication.

Review #2
By SCHAEFFER Marion submitted on 18/Mar/2026
Suggestion:
Minor Revision
Review Comment:

This paper addresses a very interesting topic: the use of ontologies to provide context for chatbots in question answering applications. The paper is very well written and addresses a topic that has received little attention in the literature. It thus seeks to demonstrate the advantages of ontologies over knowledge graphs, thanks to their rich semantic structure, when combined with LLMs.

Relevant material is provided in a ZIP folder. The folder is well documented. To ensure the reproducibility of the paper's results, I suggest paying attention to code clarity, adding code for the chatbot and the deep semantics component, and hosting the code in a publicly available repository.

The state of the art on ontology and knowledge graphs is highly relevant. The differences between them are highlighted, making it clear why ontological structures and reasoning are advantageous over knowledge graphs.

Here are a few comments:
- NLP techniques are mentioned but not described. This is the case with entity linking and named entity recognition. What algorithms are used? This detail is important because this model is central to the system’s architecture.
- It would be interesting to walk through a complete example of a conversation, including the different steps and content (EL results, prompts, answers, etc.).
- The response appears to be evaluated against the gold standard answer, which experts manually wrote. How is this comparison carried out? Manually? Automatically based on similarity? This process should be described in detail.
- Figure 3.b is not displaying correctly.

Despite the value of demonstrating the usefulness of ontologies compared to knowledge graphs, I have concerns about the scalability of the techniques presented.
- I have concerns about the use of incorrect responses. Why not model the domain directly rather than using annotated responses? Is it really feasible to manually construct structured representations rather than using the annotated responses directly within the context of an LLM?
- Why verbalize ontologies manually rather than automatically? With a large-scale ontology, this seems complicated.
- The test dataset is quite small; it would be interesting to expand it to validate the results.
- Why use only entity detection and not the generation of SPARQL queries to be executed on the graph? The results of these queries could then be used to fill the prompt.

Review #3
By ISMAILOVA Nigar submitted on 04/Apr/2026
Suggestion:
Minor Revision
Review Comment:

The paper presents an interesting and relevant study on ontology-driven context engineering for semantically aware chatbot responses. The topic is timely and the work addresses an important challenge. The manuscript is generally well-structured. At the same time several issues should be done for strengthen the paper: one important limitation of the proposed approach is that its effectiveness heavily depends on the availability and involvement of domain experts for ontology construction, enrichment and validation. This part can limit scalability and real-world applicability; the paper states that "particularly the added axiomatic richness of ontologies - has yet to be systematically explored", which is an important observation. However, it remains unclear how proposed approach actually addresses or contributes to solving this issue; there is a minor readability issue in the technical flow of the manuscript. The notation for facts in the form {h, r, t} appears before it is properly introduced and explained in the following section; there are several inconsistencies in abbreviation usage throughout the paper - KG is defined multiple times in parentheses, although it should normally be introduced only once; ECE appeared in its full extended form even after the abbreviation has already been defined. A more consistent use of terminology and abbreviations would improve the manuscript's readability and professionalism. Additionally, data file provided by authors is well organized and sufficiently clear for repeating experiments.

Review #4
By Amira Mouakher submitted on 07/May/2026
Suggestion:
Major Revision
Review Comment:

(1) Originality
The work presents a moderate level of originality, with some novel aspects but largely building on existing approaches.
(2) Significance of the results
The results are meaningful and show clear improvements, though their broader impact could be further demonstrated.
(3) Quality of writing
The paper is generally well written, but some sections would benefit from clearer explanations and improved structure.
(A) Data organization and README
A data package is provided, including a README that describes the main resources and their structure . The organization is generally clear, with separate files for responses, ontology, entity linking outputs, and code. However, the documentation remains somewhat high-level and could benefit from more step-by-step instructions for usage.
(B) Completeness for replication
The provided resources are valuable but not fully sufficient for complete replication. Several steps appear to require manual intervention (e.g., entity selection, reference entity lists, Deep algorithm inputs), and the end-to-end pipeline is not fully specified. Key implementation details (e.g., exact execution workflow, parameter settings, automation of experiments) are still missing.
(C) Repository choice
The materials are provided as supplementary files, but no persistent public repository is clearly indicated. Hosting the resources on a well-established platform such as GitHub, Zenodo, or Figshare would improve accessibility.
(4) Completeness of data artifacts
The artifacts include several important components (ontology in JSON, entity linking outputs, response evaluations, and code for light algorithms), which is a strong point. However, they do not yet constitute a fully reproducible package, as some elements of the pipeline (e.g., Deep Semantics implementation, automation scripts, and full evaluation protocol) are either missing or insufficiently documented.

--------------------------------------------------------------------------------------------------------------

The paper presents an interesting and relevant idea. The use of OntoUML/UFO foundational ontology patterns for semantic enrichment is original and goes beyond the usual graph-based RAG approaches. I think the distinction between “light semantics” and “deep semantics” is clear and the motivation is good, specially for high-stakes customer support scenarios where hallucinations and omissions are important problems.
The results are promising, specially the improvement in correctness reported for the Deep Semantics approach. However, the evaluation is still limited and some claims feel stronger than what the experiments can fully support. The final dataset is very small (13 conversations after filtering), there is no statistical significance analysis, and the correctness evaluation process is not explained clearly enough. The lack of prompt-length control is also an issue because compact prompts alone may already improve performance.
Some important technical details are missing or under-specified. The Deep Semantics algorithm is mostly described informally and would benefit from pseudo-code or a more formal definition. The entity linking process is also not explained in enough detail (models, thresholds, ambiguity handling, etc). The verbalization process is central to the approach but still unclear in the current version.
I also think the related work section should discuss more recent ontology-grounded and logic-aware RAG approaches. The comparison with stronger text-centric RAG baselines is missing.
The paper is generally understandable, but there are still typos, formatting artifacts, and some figures/tables are difficult to interpret from the current version.
-Please provide more details about the entity linking setup, including models, thresholds, similarity metrics, and ambiguity handling.
-How exactly were correctness judgments performed? Were evaluators blind to the conditions? Was inter-annotator agreement measured?
-How is ontology verbalization implemented? Are templates manual or automatic?
-Did the authors control for prompt length across methods?
-Can the authors include stronger baselines, specially modern RAG pipelines with dense retrieval and reranking?
-Can the authors provide ablation studies showing the contribution of specific ontology patterns (RELATOR, ROLE, PHASE, EVENT, etc.)?
-How does the method behave with noisy entity linking outputs or unseen problems?
-What was the engineering effort required to build and maintain the ontology?
- Could the authors clarify where the replication resources are available (e.g., repository link)?