OntoMatcher: Leveraging Context-Aware Siamese Networks, LLMs and BioBERT for Enhanced Biomedical Ontology Alignment

Tracking #: 3477-4691

Zakaria Hamane
Abdelhadi Fennan
Amina Samih

Responsible editor: 
Jérôme Euzenat

Submission type: 
Full Paper
Biomedical ontologies play a crucial role in knowledge representation and standardization within the biomedical domain. With the rapid growth of ontologies, the need for efficient and accurate alignment techniques has become paramount to ensure interoperability between various biomedical systems. Current ontology alignment methods often struggle to cope with the complex and dynamic nature of biomedical terminologies, resulting in suboptimal performance. In this study, we introduce a novel supervised deep learning approach for aligning biomedical ontologies, employing Large Language Models (LLMs) alongside Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT), One Dimensional Convolutional Neural Network (1D-CNN), highway networks, bi-directional long short-term memory (Bi-LSTM) and Siamese Network models. This approach captures character-level and contextual information of entities and efficiently incorporates entity descriptions and context embeddings to improve alignment accuracy. The results of our method demonstrate a significant improvement in performance, achieving an F1 score of 0.87 for match/not match classifications and 0.94 for level classifications, outperforming several baselines on benchmark datasets. These results indicate the potential of our approach, employing LLMs for data enrichement and Transformer models for embeddings, in facilitating a more effective alignment of biomedical ontologies. Ultimately, this enhances data integration and interoperability across different biomedical systems.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Daniel Faria submitted on 18/Jul/2023
Major Revision
Review Comment:

The paper presents OntoMatcher, an AI-based Ontology Matching tool that combines BioBERT with character-level 1D Convolutional Neural Networks for encoding entity labels, and then combines Siamese Networks with XGBoost for computing similarity between entities.

First of all, I have to point out that there is already a well-established tool among the bio-ontologies community called "Ontomatcher" (https://github.com/chanzuckerberg/Ontomatcher) for matching text to ontology terms, so I would urge the authors to rename their tool to avoid confusion.

The paper is well-structured and clearly written, ranking high in terms of quality of writing.

It also ranks reasonably high in terms of originality. Harnessing Large Language Models is a "hot topic" among the ontology matching community, and although there are already a few published papers on the topic and one established tool (BERTMap), the proposed approach is sufficiently different to merit publication.

Where the work falls short is in terms of the significance of the results, as indeed there are no proper results. The experiments carried out by the authors and telegraphically described in the paper evaluate only the proposed tool, without comparing it to any baseline or competing tool. Moreover, they rely on an ad hoc benchmark that combines the test set of the UMLS data used by the authors with the OAEI SNOMED-NCI task (at least I assume that they aggregate the two, given that only a single result row per model of their tool is presented) making impossible to compare the results with those of other tools. I’m also concerned about data circularity, given that the OAEI SNOMED-NCI task has a reference alignment derived from the UMLS, which the authors have used for training. Finally, I’m concerned by the fact that the authors only shared the source code in their GitHub repository, but did not share the training or test data or the trained AI models, meaning that the work is not reproducible.
As of 2022, the OAEI introduced benchmarks specifically for AI-based tools under the BioML track. This includes several benchmarks with training, testing and validation data. In order to properly evaluate their proposed tool, the authors must adopt these benchmarks. Moreover, they must compare their results with those of the top performing tools in the OAEI 2022. Even better would be for the authors to actually participate in the 2023 edition of the OAEI to ensure an impartial evaluation, but I acknowledge that the timing may not be opportune. Last but not least, the authors should present and discuss the results of such an evaluation in a less telegraphic manner.

Another aspect where the paper falls short is in presenting and discussing the related work. Namely, only two papers on the topic of AI-based ontology matching are mentioned, the latest of which from 2020, whereas several more recent works have been published on the topic, including the tool BERTMAP which participated in the OAEI 2022’s BioML track. Moreover, at least the best-established non-AI tools for bio-ontology matching, LogMap and AgreementMakerLight, should also be referenced.

Review #2
Anonymous submitted on 19/Sep/2023
Review Comment:

This paper proposes OntoMatcher, an approach for the alignment of biomedical ontologies. The paper argues that the proposed approach considers the character-level and contextual information of entities and efficiently incorporates entity descriptions and context embeddings.

Their contribution is, on one hand, using language models and a CNN architecture to better represent out-of-vocabulary terms, labeling parent-child relationships to be used for ontology alignment, and enriching ontologies with additional descriptions and context of entities using language models.

The paper, in its current format, is not well-structured, and the choices of the architectures made have not been justified. The paper contains many typos, the figures are of low quality and not well-documented. The datasets used for experiments are not clearly cited, and there is no benchmarking with previous work. Instead of explaining the code or how a Siamese network works or how XGBoost works, the space should be used to provide motivation for each section, specify what is novel, what problem it addresses, and how all the sections fit together. The paper is not easy to follow and is not self-contained. It lacks a broad analysis of the problem (related work and problem statement) and does not convince of the proposed approaches or architecture used.

Section 2:
The related work is not well-structured and does not cover the existing literature or state how the proposed work would overcome those limitations. It does not precisely specify how different works in the literature differ from one another and what the limitations of each are with regard to biomedical domain ontologies. Half of the space in this section is used to reiterate the proposed work of the paper, which does not align well with the title of the section.

Section 3:
It is not clear what the motivation behind this chapter is. I would suggest focusing only on the problem statement here and not referring the reader to different sections of the paper. Instead, add proper arguments for the overall approach you're giving that answer "why" rather than "how."

Section 4:
You should give an introduction to each section, specifying what is going to be elaborated on in it and why. This is not clear when reading this section. The different labels defined are introduced more than once, making it repetitive, although never in a formal way ("We define 4 labels ..."). There are many typos in this section that make the definitions hard to follow. The explanations given for how the procedure works (like parent-child relations in 4.2) should not be explained with such details; rather, the algorithmic way of how it is done should be outlined. There has been no argument of why you have done as you have done for candidate selection and negative example generation. There are many alternative ways to do so. Why not choose random negatives? Why not embed the words in a space and choose the k-closest candidates? Why have you used the features you have listed? Also, in terms of complexity and runtime of the proposed approach, it's good to mention how long it takes and where the main overhead is, if any.

Section 5:
If 5.1 is a novel approach, it should be explicitly mentioned. Citations are missing for justifications or experiments backing the arguments. Why has a Siamese network been chosen to be trained has not been justified. The negative examples mentioned earlier could have been. Why an XGBoost classifier? Have you tried other classifiers and this worked best?

Section 6:
What are the exact datasets used? Where is the link to the datasets or citation? It would be interesting to compare with OAEI results or other benchmark results.

Minor Suggestions:

In the abstract, mentioning the results in numbers for F1-score is a bit out of context (given that the numbers make sense when compared together and when the dataset is specified). Mention in the abstract that the name of your proposed approach is OntoMatcher.
The titles chosen for sections 2, 3, and 4 are too long and not too specific for a title.
Figures have low quality; the procedure is not clear. Try numbering and labeling different parts of the figure (e.g., (a), (b), (c)) for the reader to follow the process as explained in the caption or text.

---- Some Minor remarks----

Inconsistencies in citing formats
The formulas are not in proper env and hence not clean.

Section 1:
(3rd point of contribution, keep present tense): we leveraged --> We leverage
This approach enriched --> We show that this approach enriches..

Section 2:
(Typo) e.g. --> e.g.,

Section 3:
Fig.1 Ontologies O_{s} and O_{t} to be compatible with the text. The quality of the figure (esp. the ≡ sign) is not good.
Typo: Ot instead of O_{t}
fig2, title: OntoMatcher System Framework --> Overview of the approach (Overview of OntoMatcher)
fig2, Candidate Section depicted twice in the Figure
It can accept two ontologies --> It accepts two ontologies O_{s} and O_{t} as input
In scenarios where a neural network is utilized --> When a NN is used..
The use of § is not clear, replace with Section.

Section 4:
Typos making it very hard to follow notations e_{s}, e_{t},.. Try to keep the same notation throughout the paper and use \textit{} to differentiate between names and the text.
Figure 4 below --> Figure 4
As schematized in Figure 5, First, --> As shown in Figure 5, first, ...

Section 5:
Figure 6: Typo: Architecture

stages --> steps