Evaluating Ontologically-Aware Large Language Models: An Experiment in Sepsis Prediction

Tracking #: 3890-5104

Authors: 
Lucas Gomes Maddalena
Fernanda Araujo Baião
Tiago Prince Sales
Giancarlo Guizzardi1

Responsible editor: 
Aldo Gangemi

Submission type: 
Full Paper
Abstract: 
Early and accurate detection of sepsis during hospitalization is critical, as it is a life-threatening condition with significant implications for patient outcomes. Electronic Health Records (EHRs) offer a wealth of information, including unstructured textual data, often containing more nuanced insights than regular structured data. To process such textual data, a variety of Natural Language Processing (NLP) methods have been employed with limited effectiveness. Recent advancements in computational resources have led to the development of Large Language Models (LLMs), which can effectively process vast amounts of text to identify relationships and patterns between words and structure them into embeddings. This enables LLMs to extract meaningful insights within specific domains. Despite these advances, LLMs face challenges in capturing the real-world semantics of clinical texts, which are critical for understanding the complex interconnections among terms and ensuring terminological precision. This work proposes a case study using Clinical KB BERT, an approach for embedding clinical notes of ICU patients that incorporates semantic information from the Unified Medical Language System (UMLS) ontology. By integrating domain-specific knowledge from UMLS, Clinical KB BERT aims to improve the semantic understanding of clinical data, thus enhancing the predictive performance of the resulting models. The present study compares Clinical KB BERT against Clinical BERT, a widely used model in the healthcare domain. The experimental results demonstrate that semantically enriched embeddings produced a more accurate and less uncertain model for the early prediction of sepsis. Specifically, it increased the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) from 0.826 to 0.853, while the mean predictive entropy for the entire test dataset decreased from 0.159 to 0.142. Furthermore, the reduction in mean predictive entropy was even more pronounced in cases where both models made correct predictions, decreasing from 0.148 to 0.129. Noteworthy, the practical impacts of these improvements include a substantial decrease in the number of false negatives (from 162 to 128, out of 227 septic cases), emphasizing the ability of the semantically aware model in reducing missed early diagnoses, and improving patient outcomes.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 22/Sep/2025
Suggestion:
Major Revision
Review Comment:

In the paper titled “Evaluating Ontologically-Aware Large Language Models: An Experiment in Sepsis Prediction”, the authors evaluate Clinical KB BERT, an existing knowledge-based extension of Clinical BERT. It relies on the UMLS knowledge base, which is encoded as relation triplets and then injected into the input sequence of the transformer BERT model. Both models are applied to the task of early sepsis onset prediction in an ICU setting.

Overall, I feel like the paper addresses a timely and relevant topic, has a sound structure, and is a good fit for this journal. A particular strength of the manuscript is the detailed methodological description of the Clinical KB BERT model and its integration of the UMLS knowledge. Moreover, it makes a strong case for the application to sepsis prediction by highlighting both the importance of the task and the shortcoming of existing approaches.

Despite these strengths, I have the following main concerns regarding the current state of the manuscript:
(1) Novelty and state-of-the-art: 
The authors explicitly state that they do not intend to introduce a novel method or algorithm but instead consider the application and analysis of Clinical KB BERT within the sepsis prediction use-case as the main contribution and novelty of their work.
While I generally agree that such application- and evaluation-focused studies can offer valuable insights to the research community, I see two main shortcomings in this regard:
(a) Besides some background on sepsis prediction and the associated challenges, as well as the datatset and feature selection underlying the evaluation, the manuscript neither provides any domain-specific insights during analysis nor a clinical validation or demonstration. Instead, the analysis remains mostly technical and general, not discussing the implications of findings considering the domain of application.
(b) The choice of Clinical (KB) BERT as model feels a bit outdated and is not well-motivated. Given the recent advancement in state-of-the-art NLP, the impact of the study is quite limited due to this choice of model. While there would certainly be reasons for choosing a BERT-based model over recent LLMs (e.g., through constrains in computational resources or privacy regulations requiring a self-hosted model), the authors do not properly motivate their choice. Even more surprisingly, there is not even a mentioning of more advanced LLMs such as GPT etc. In this regard, I suggest either justifying the choice of a BERT-based model in more detail or considering the use of state-of-the-art LLMs, such as GPT-based models.
Overall, the study is therefore oddly placed between a technical paper and a case study, where it lacks technical novelty and state-of-the-art methods to qualify as the former and lacks domain-specific analyses and findings to qualify as the latter.

(2) Empirical evaluation:
First of all, I want to emphasize that Section 5 discusses the empirical results in great detail and that the analyses provided are relevant to the use-case at hand. However, it would benefit from some improvements:
(a) The evaluation of different failure modes (sections 5.4 to 5.8) is quite extensive but puts a lot of focus on basically discussing FPs, FNs, TPs, and TNs of the two model types. While I appreciate this effort, it is quite repetitive after a while and the added value compared to the previous results of Table 4, Table 5, and Table 6 is rather limited. I recommend shortening these sections and providing additional analyses, e.g., the effect of different prompting patterns to inject the knowledge or the impact of different knowledge bases in general.
(b) The selection of baselines is limited to a single, semantically unaware BERT model. In order to properly assess the true value of Clinical KB BERT, I suggest additional baselines, possibly including some (non-)pretrained LLMs such as GPT or Llama.
(c) Also, as suggested in (1), the evaluation would benefit from use-case-specific analyses. In the case of sepsis prediction, it could - for example - be interesting to analyze which sub-diagnoses or types of sepsis benefit more/less from semantically aware models. This could also be supported by providing an exemplary patient case and highlighting how knowledge-injection improved the patient outcome.

(3) Presentation: 
Although the paper is mostly well-organized and overall easy to follow while still methodologically detailed, some issues regarding the overall presentation persist.
(a) There are some serious consistency issues throughout the paper: First, there is a very inconsistent use of abbreviations/acronyms, including abbreviations used before being introduced, abbreviations being introduced several times, and abbreviations introduced but never used.
Moreover, some terminology is imprecise or inconsistent. For example, the authors change between AUC-ROC and ROC-AUC throughout the manuscript.
(b) Some elements of Figure 1 are difficult/impossible to read and would benefit from both increased size and resolution.
(c) The structure of the Results and Discussion section could be improved: Section 5.3 only introduces the different evaluation scenarios that are analyzed in detail in the sections 5.4 to 5.8. Although it makes sense to introduce all scenarios before providing details on each of them, I don’t feel like section 5.3 should be of the same section-level than the following sections, as it does not contribute to the actual discussion of the empirical results.
(c) Some typos and difficult sentences could be improved further.
(d) Several times, broken/missing references occur for citations.
(e) The outline of the red boxplots in Figures 5 and 7 are quite difficult to see, as the boxplots are located within the violin plots. It is hard to see for example the median values.
Given the sheer frequency of these errors, the manuscript would benefit from additional proof-reading.

Additionally, some minor remarks:
- In my opinion, referring to BERT or its variations as LLMs is somewhat imprecise. Both given the number of parameters and the encoder-only, transformer-based architecture of BERT, it does not qualify as LLM in today’s standards. In fact, the authors use the more precise term of masked language models (MLM) once in the paper but don’t seem to stick with it. While I understand the timeliness of the term LLMs, using it here seems to be imprecise and raises false expectations.
- Table 3 is supposed to show the results of 8-fold cross-validation, where in fact, the results of 9 folds (0 to 8) are reported. Please clarify this discrepancy. Also, the table would benefit from a row listing the mean over all folds, possibly including the standard deviation, too.
- In the introduction alone, you use three different types of dashes, all used for the same purpose (see page 2, lines 30,34, and 35). I suggest standardizing this.

Review #2
Anonymous submitted on 18/Dec/2025
Suggestion:
Minor Revision
Review Comment:

The paper investigates whether incorporating structured ontological knowledge into language models improves early sepsis prediction from ICU electronic health records. Specifically, it compares Clinical BERT with Clinical KB BERT, a semantically enriched variant that integrates the UMLS ontology during pretraining.

Using the MIMIC-III dataset, the authors embed clinical notes and combine them with structured physiological data in a GRU-based temporal model to predict sepsis onset within a 4-hour window. Performance is evaluated using ROC-AUC, MCC, recall, calibration (Giviti belt), and predictive entropy to assess both accuracy and uncertainty.

The results show that the ontologically enriched model achieves higher ROC-AUC, better MCC, improved recall, fewer false negatives, and lower predictive entropy, suggesting more reliable and clinically useful predictions. The paper also provides a detailed stratified analysis of cases where semantic enrichment helps or fails.

Summary of Strengths:

Conceptual contribution: a clear and well-motivated focus on semantic grounding via ontologies, addressing a known limitation of clinical NLP.

Strong alignment with Semantic Web and ontology-driven AI themes.

Methodological rigor: careful experimental design with 8-fold cross-validation, hold-out testing, and multiple complementary metrics.

Appropriate use of MCC and recall given the strong class imbalance and clinical context.

Inclusion of calibration analysis and predictive entropy goes beyond standard performance reporting and strengthens the reliability argument.

Empirical results: consistent improvement of Clinical KB BERT over Clinical BERT in clinically critical dimensions (recall, false negatives, uncertainty).

Quantitative gains are modest but meaningful in a high-risk medical setting.

Stratified analysis of prediction cases provides insight into where and why semantic enrichment helps.

The link between ontology integration, uncertainty reduction, and clinical impact is clearly articulated.

Summary of Weak Points:

My main concern regards the use of MIMIC-III for evaluation when both models (Clinical BERT and Clinical KB BERT) have been finetuned on it.

Novelty limitations: the work is largely an application and evaluation of an existing model (Clinical KB BERT) rather than a new modeling contribution.

Ontology integration itself is not novel; the main novelty lies in the sepsis prediction use case.

The manuscript is overly long and repetitive in several sections, especially in the Results and entropy analyses.

Some paragraphs restate the same conclusions multiple times with minimal added insight.

Minor grammatical issues, awkward phrasing, and occasional inconsistencies reduce polish.

Related work and citations

Several placeholders remain (e.g., “author?” or “[?]”), to be fixed (9, 14, related to citations [25] and [33])

Although uncertainty is well analyzed, there is limited qualitative or clinical interpretation of which semantic relations or ontology components contribute most to improvements.

Suggestions for Improvements and Corrections:

It is very important that the authors clarify whether the use of MIMIC-III as the evaluation dataset is justified and ensure that there is no risk of data leakage.

Tighten the manuscript

Reduce redundancy in Sections 5.3–5.8 by consolidating entropy analyses.

Fix citation and reference issues: resolve all placeholder references (“author?”, “[?]”) and ensure consistent citation formatting.

Clarify novelty and contribution: more explicitly position the paper as an evaluation of ontological enrichment in a high-stakes clinical task, rather than implying a new modeling technique.

Highlight what insights this study provides beyond raw performance gains (e.g., reliability, safety).

Ensure consistent terminology (e.g., “semantically aware/unaware”) throughout.

Conclusion:
The paper is a rather solid, well-motivated work with clear relevance to both clinical AI and semantic technologies. Its main weaknesses lie in the limited novelty at the modeling level and the unassessed risk of data leakage from the use of the same dataset (MIMIC-III for both evaluation and training). If these points are clarified and the writing improved it would be a suitable contribution for a specialized journal like Semantic Web Journal.