Interpretable Ontology Extension in Chemistry

Tracking #: 3052-4266

Martin Glauer
Adel Memariani
Fabian Neuhaus
Till Mossakowski
Janna Hastings

Responsible editor: 
Guest Editors Ontologies in XAI

Submission type: 
Full Paper
Reference ontologies provide a shared vocabulary and knowledge resource for their domain. Manual construction and annotation enables them to maintain high quality, allowing them to be widely accepted across their community. However, the manual ontology development process does not scale for large domains. We present a new methodology for automatic ontology extension for domains in which the ontology classes have associated graph-structured annotations, and apply it to the ChEBI ontology, a prominent reference ontology for life sciences chemistry. We train Transformer-based deep learning models on the leaf node structures from the ChEBI ontology and the classes to which they belong. The models are then able to automatically classify previously unseen chemical structures, resulting in automated ontology extension. The proposed models achieved an overall F1 score of 0.80, an improvement of 6 percentage points over our previous results on the same dataset. In addition, the models are interpretable: we illustrate that visualizing the model's attention weights can help to explain the results by providing insight into how the model made its decisions. We also analyse the performance for molecules that have not been part of the ontology and evaluate the logical correctness of the resulting extension.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/Apr/2022
Minor Revision
Review Comment:

This submission presents a new approach to automatically extending the ChEBI ontology via deep learning models.

Overall, this is a very well-written paper that is, I think, a good contribution to the current state-of-the-art in the area. The approach presented by the authors has its drawbacks and limits, which the authors themselves point out while nonetheless making a solid case for the value and potential of their approach.

Therefore, I think that this contribution can be accepted.

Some minor comments follow:

- Page 7, line 37. just before Section 3.2: "we created a second dataset without restricting to a maximum of 100 members..." I think that here it was intended "... to a *minimum* of 100 members"? Above (line 28 of the same page) we are requiring at least 100 individuals per class.

- Page 11, line 4: "Yet, the model show overall good performance". This seems to be very much class-dependent. For some classes, this model clearly does not seem go give very useful predictions at all, while for others it seems to be pretty reliable. It would be interesting to look further into for which kinds of classes this approach works and for which ones it doesn't (the authors mention structural features cycles being potentially problematic, for example).

- Page 12: "Figure 10 illustrates a selection of the attnetion weighs..." if I'm not mistaken, both Figure 9 and Figure 10 do that, and the following discussion is about both. So I would say "Figures 9 and 10 illustrate..."

- Page 16, line 1: "The system extends the given ontology using the same ontology language that has been used to build it". This is true, but it does not use all the features of the language that are used by the original ontology: for example, the original ontology makes use of disjointness axioms, but the system cannot introduce novel disjointness axioms but (if I'm not misunderstanding it) merely assigning new molecules to given classes. Despite the approach of the authors being, I think, interesting and valuable, this is a limitation that I think is worth remarking.

Review #2
By Uli Sattler submitted on 03/May/2022
Minor Revision
Review Comment:

In this paper, the authors describe an approach (in 2 variants) to extend a well established chemistry ontology, Chebi, with new classes (more precisely, suggestions for new, atomic ’SubClassOf’ axioms) for un-seen chemicals from their SMILES strings. The approach uses transformers/deep learning models and domain-specific embeddings of the SMILES strings and is trained on the existing Chebi ontology (where the SMILES strings are captured as annotations of classes). The results are promising: on existing Chebi classes, they achieve good F1 scores and outperform the authors’ previous approach, and on those not yet covered in Chebi they look good but exact evaluation is part of future work. The new approach has the potential to be transparent: considering areas of attention of the model on (positive?) classifications can indicate reasons for the classifications.

The paper reads well and the results are interesting, though I think the presentation can and should be made more clear by addressing the following points. Also some of the claims about this new approach should be made a little more carefully to fit the evidence gathered so far:

- Explanations of diagrams and plots need to be clearer: all axes and colours used need to be explained, ideally in the caption (eg Fig 7, it’s unclear what the x-axis is and what the colours mean (perhaps the ‘blue/red’ for Fig 7(c) is also used for (a) and (b) but this can be made clearer. Eg Fig 8, what is enumerated on x-axis (this becomes sort of clear in the text but should also be clear from caption)). Also, please make sure that numbers on axes are readable (even with quite a lot of zooming in, this isn’t the case in Figure 2 - and it’s also lacking labels)

- Some of the claims made are not strongly supported by the evidence provided in the paper: the interpretability/explainability is discussed by an interesting example, but a suitable evaluation is left for future work. Furthermore, it seems that explanations will only be available for positive classification: what would one do for false negatives? Similarly, the current approach addresses ontology learning in a very weak form as it is restricted to learning of atomic subclass-relationships. While the results are interesting, one could also call this ‘class localisation’ or ‘class insertion’.

- Throughout the paper, the nature of the (structured) annotation used should be made more clear: it took me a while to realise that the SMILES strings were used (and without further statements around them) since the annotations used are described in quite a few different ways first.

More detailed comments and suggestions:

Page 2
- line 5: rephrase ‘Cheb tries to’ to ‘Chebi engineers try to ‘ or such like
- Line 12: is there a reference for Chebi’s workflow? Also "navigating the ontology scaling dilemma “: is ‘navigating' really what you mean here?
- Line 13: I don’t understand “ design decisions [..] analogously to new classes and relations, “

Page 3:
- perhaps also explain what the *target* of these ontology extension approaches are (are they all aimed at atomic SubClassOf axioms?)
- Would the following be clearer? "Given the *documented, structured* design decisions by the ontology developers, how would they extend their ontology to cover a novel entity? “
- Line 39: "within these structures *whose* sub-graphs may themselves”?
- Perhaps move the explanation of SMILES to an earlier point, eg a small section on ‘background on Chebi’?

Page 5
- line 16: can you be more precise on "and the system as a whole was not explainable.”
- Line 31: "based on the design decisions that are implicitly reflected in the structure of ChEBI. ” one of the places that confused me (see above): isn’t your approach rather focussed on the (structured) annotation documenting/reflecting these design decisions?

Page 6
- line 34: "One of these successful architectures is RoBERTa [44], *whose* architecture offers a learning paradigm “
- Line 33: here and later in the evaluation, it would be interesting to know this distribution of ‘several’ (direct) superclasses: how many classes have 1 superclass? How many 2?…
- "with a plausible real-life dataset of chemicals” isn’t it that the related use case is realistic (rather than the dataset ‘real-life’)?

Page 7: line 25 could you briefly sketch the algorithm used and/or explain the ‘class merging’ step (perhaps using an example)?

Section 4.1: given that we’re looking at multi-label classification could you please briefly explain precision/recall: do we need to get the whole label set correct to be correct or is this counted on a ‘per label’ basis’ (and perhaps drop the explanation on page 11 of the usual precision and recall)? Also, perhaps illustrate the different F1 scores using a small example?

Page 10: why is Table 2 and Figure 7 restricted to Electra - or - how are these for Roberta?

Page 11: briefly explain what a ‘smaller class’ is? Also, "some of the predicted subclass relationships can be determined to be correct according to ChEBI, while others are incorrect.” - is confusing: some of these *are* correct according to Chebi - and that’s how the whole evaluation works, right? This needs clarifying, in particular, what is the verdict if, for chemical X, the predicted direct superclass is Y but it should be Y’s sub- (or super-)class Z?

Page 13 line 47: how many are ‘several’ (see above - would a distribution be interesting?)?

Section 4.3: did you eye-ball the extended ontology? It seems that one could relatively easily pick some of the new classes and (ask some chemists to) check the classification (even manually this should be feasible for quite a good sample)?

Page 16 line 1: which part of the OWL DL expressivity is relevant for your approach? Isn’t it restricted to /focussed on/working with the (inferred) class hierarchy and treating the rest of the ontology as a black box?

Page 16 line 32: " Visualisations such as those in Figs. 9 and 10b can be used to explain decision made by the model, raise trust in the prediction system, a” this looks a bit like an over-statement to me.

Review #3
Anonymous submitted on 06/May/2022
Major Revision
Review Comment:

(1) originality

The paper is reasonably original. It details the application of transformer networks to an ontology extension task in the domain of chemical classification.

Transformer networks are well studied by now and the paper does not make a technical contribution at that level. So are the tricks around using attention for some form of "interpretation" of results. The paper is really an application paper with some smaller interesting choices such as tokenization.

Primarily, I think the paper is interesting because of the domain it addresses and the progress the model is making. I think that this could be interesting for ML practicioners in the biomedical field.

The part that is most questionable in my view is the discussion about interpretability/explainability. I'd suggest to use the correct term here - which I think is "interpretability" and de-emphasize the contribution. It's not clear to me from the paper that the results really are intereprable in any meaningfull sense. For that argument to hold - authors would have to show additional data that this is indeed convincing experts. Otherwise, it's just a story about interpretability. Within the context of the paper the attention analysis makes sense - as it shows the model has learned something about the domain. But for it to be truly about interpretability or even explainability requires additional experiments of the model with experts (which I didn't read in the paper)

(2) significance of the results

Results present a statistically significant increase over the state of the art.

The state of the art though hasn't seen much coverage by other attempts. It seems to be mostly a niche that the authors are occupying. That's obviously not necessarily a bad thing or something the authors can change - but it does limit the overall significance of results.

It would be good to see more date about the choices - i.e. ablation studies w.r.t to tokenization and other choices

How were hyperparameters choosen?

(3) quality of writing

Generally the writing is good. It's relatively easy to understand and follow.

However in certain cases authors struggle to get to the point and explain why we need to understand a particular part of the system. I'd suggest to work more with tables and also subparagraphs to structure especially the technical sections. Separate the description of the current system from motivation. This is of course subjective - but I'd rather know quickly what the system is and then get a discussion/motivation of design choices. In part the paper reads like a story of what was tried - which is fine but that clouds the details of what was actually used in the paper. There are different parts where I thought the paper could be significantly clearer.

a) the data part 3.1 is quite confusing. it might help to have a table - clearly delineating the different datasets used in the experiment and then the dataset constrcuted from these and then which model was trained with which OR at least organize the section to clearly explain one after the other

b) the whole tokenization part was very difficult to follow. not because the topic is difficult but because we are getting a lot of information about word vs character-level tokenization and then BPE that ultimately it was not very clear which was used. I suggest removing large part of the discussion or make it very clear what was chosen and then justify.

c) the model part has a similar structure. I'd prefer to have a clear statement about what the models are and then a paragraph on each model. then motivation for different choices

(4) Other comments

almost all figures have very small legends and many are missing clear x,y legends

Section related work. the last paragraph of "text-based" seems to really be better placed in the subsequent section

I don't understand why Roberta is introduced if there are no results discussed

Since you really only have one baseline it might make sense to have that model available in an Appendix or at least offer a short description of the differences - e.g. tokenization if any

Figure 7 (a,b) what's x axis? I assume it's the classes but how did you order them?
Figure 7 (c) to me the color doesn't look red but orange
Figure 8 is confusing. what's on the x axis?
Figure 11 right. I am not sure a scatter plot without any ordering is really informative

Section 5 - unsupervised In ML terms you are not doing unsupervised learning - even if labels come from the ontology.
Section 5 - interpretability - see comments above. in this section it seems you are saying that indeed this doesn't actually work in terms of interpretability

Section Explainability - please rename (see comment about explainability/interpretability)

Assessment of “Long-term stable URL for resources”

(A) Data file is well organized and in particular contains a README file which makes it easy for you to assess the data

(B) The provided resources appear to be complete for replication of experiments

(C) The chosen repository is Zenodo and appropriate for long-term repository discoverability

(D) Data artifacts seem complete