Review Comment:
The paper investigates whether incorporating structured ontological knowledge into language models improves early sepsis prediction from ICU electronic health records. Specifically, it compares Clinical BERT with Clinical KB BERT, a semantically enriched variant that integrates the UMLS ontology during pretraining.
Using the MIMIC-III dataset, the authors embed clinical notes and combine them with structured physiological data in a GRU-based temporal model to predict sepsis onset within a 4-hour window. Performance is evaluated using ROC-AUC, MCC, recall, calibration (Giviti belt), and predictive entropy to assess both accuracy and uncertainty.
The results show that the ontologically enriched model achieves higher ROC-AUC, better MCC, improved recall, fewer false negatives, and lower predictive entropy, suggesting more reliable and clinically useful predictions. The paper also provides a detailed stratified analysis of cases where semantic enrichment helps or fails.
Summary of Strengths:
Conceptual contribution: a clear and well-motivated focus on semantic grounding via ontologies, addressing a known limitation of clinical NLP.
Strong alignment with Semantic Web and ontology-driven AI themes.
Methodological rigor: careful experimental design with 8-fold cross-validation, hold-out testing, and multiple complementary metrics.
Appropriate use of MCC and recall given the strong class imbalance and clinical context.
Inclusion of calibration analysis and predictive entropy goes beyond standard performance reporting and strengthens the reliability argument.
Empirical results: consistent improvement of Clinical KB BERT over Clinical BERT in clinically critical dimensions (recall, false negatives, uncertainty).
Quantitative gains are modest but meaningful in a high-risk medical setting.
Stratified analysis of prediction cases provides insight into where and why semantic enrichment helps.
The link between ontology integration, uncertainty reduction, and clinical impact is clearly articulated.
Summary of Weak Points:
My main concern regards the use of MIMIC-III for evaluation when both models (Clinical BERT and Clinical KB BERT) have been finetuned on it.
Novelty limitations: the work is largely an application and evaluation of an existing model (Clinical KB BERT) rather than a new modeling contribution.
Ontology integration itself is not novel; the main novelty lies in the sepsis prediction use case.
The manuscript is overly long and repetitive in several sections, especially in the Results and entropy analyses.
Some paragraphs restate the same conclusions multiple times with minimal added insight.
Minor grammatical issues, awkward phrasing, and occasional inconsistencies reduce polish.
Related work and citations
Several placeholders remain (e.g., “author?” or “[?]”), to be fixed (9, 14, related to citations [25] and [33])
Although uncertainty is well analyzed, there is limited qualitative or clinical interpretation of which semantic relations or ontology components contribute most to improvements.
Suggestions for Improvements and Corrections:
It is very important that the authors clarify whether the use of MIMIC-III as the evaluation dataset is justified and ensure that there is no risk of data leakage.
Tighten the manuscript
Reduce redundancy in Sections 5.3–5.8 by consolidating entropy analyses.
Fix citation and reference issues: resolve all placeholder references (“author?”, “[?]”) and ensure consistent citation formatting.
Clarify novelty and contribution: more explicitly position the paper as an evaluation of ontological enrichment in a high-stakes clinical task, rather than implying a new modeling technique.
Highlight what insights this study provides beyond raw performance gains (e.g., reliability, safety).
Ensure consistent terminology (e.g., “semantically aware/unaware”) throughout.
Conclusion:
The paper is a rather solid, well-motivated work with clear relevance to both clinical AI and semantic technologies. Its main weaknesses lie in the limited novelty at the modeling level and the unassessed risk of data leakage from the use of the same dataset (MIMIC-III for both evaluation and training). If these points are clarified and the writing improved it would be a suitable contribution for a specialized journal like Semantic Web Journal.
|