Discovering alignment relations with Graph Convolutional Networks: a biomedical case study

Tracking #: 2849-4063

Pierre Monnin
Chedy Raïssi
Amedeo Napoli
Adrien Coulet

Responsible editor: 
Guest Editors DeepL4KGs 2021

Submission type: 
Full Paper
Knowledge graphs are freely aggregated, published, and edited in the Web of data, and thus may overlap. Hence, a key task resides in aligning (or matching) their content. This task encompasses the identification, within an aggregated knowledge graph, of nodes that are equivalent, more specific, or weakly related. In this article, we propose to match nodes within a knowledge graph by (i) learning node embeddings with Graph Convolutional Networks such that similar nodes have low distances in the embedding space, and (ii) clustering nodes based on their embeddings, in order to suggest alignment relations between nodes of a same cluster. We conducted experiments with this approach on the real world application of aligning knowledge in the field of pharmacogenomics, which motivated our study. We particularly investigated the interplay between domain knowledge and GCN models with the two following focuses. First, we applied inference rules associated with domain knowledge, independently or combined, before learning node embeddings, and we measured the improvements in matching results. Second, while our GCN model is agnostic to the exact alignment relations (e.g., equivalence, weak similarity), we observed that distances in the embedding space are coherent with the ``strength'' of these different relations (e.g., smaller distances for equivalences), letting us considering clustering and distances in the embedding space as a means to suggest alignment relations in our case study.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Matthias Samwald submitted on 11/Aug/2021
Review Comment:

The authors sufficiently addressed my suggestions for improvement.

Review #2
Anonymous submitted on 20/Sep/2021
Review Comment:

I appreciate the work done for improving the paper. I still think that the novelty could be better emphasized but the work is interesting and well done.

Review #3
By Ernesto Jimenez-Ruiz submitted on 24/Sep/2021
Minor Revision
Review Comment:

I would like to thank the authors for their response letter and clarifications. I believe they have covered the comments of all three reviewers. The scope of the paper is clear now, which was one of the main negative points I highlighted in my original review. Now the paper can be reviewed from a different angle. The paper presents an interesting application of GCNs to learn KG embeddings in a specific setting.

Learned embeddings seems to be meaningful and consistent to the type of similarity.
The application domain is clear now and the need to align individuals in a single KG; but due to the fact there is the need of discovering new alignments as this KG integrates knowledge from different sources.

Additional comments:

Regarding Figure 2. Does it mean the “combination” of CYP2C9 + drug warfarin cause phenotype vascular_disorders? Depending on the logics of cause, this may not be correct. Is the property causes transitive? Or are specific rules taking care of this? If Cause(CYP2C9, x) and causes(warfarin, x) then causes (x, vascular_disorders)?

I understand now the motivation of using a custom inference engine to have more control. Although this could be also done by providing a subset of the rules to trigger to state of the art inference engines. Creating a custom inference engine is valid, but one may also need to guarantee scalability, soundness and completeness. Has this been taken into account?

As a supervised approach, the presented approach relies on previously defined gold clusters. For the presented approach it is valid to assume that this exists, however in practice it may be convenient to automate the generation of these gold cluster, relying form example on (incomplete but) very precvise alignments. On a different setting, but we followed this idea in [a]. May be worth adding some discussion about how to automate the generation of gold clusters.

The paper in [12] provide interesting facts about more is not always better for some embedding approaches, this is indeed interesting to understand how KGE approaches behave. Other works have also tried to inject the new inferences into the loss function instead of as materialization [b]. In OWL2Vec [c], a system inspired by RDF2Vec, we showed the potential benefit of taking into account the ontology and performing reasoning/materialization into the embeddings.

Minor comments:
- This initial of selection - This initial selection
- to handle a large number of clusters, potentially large → use synonym of large?...

Suggested literature:
[a] Jiaoyan Chen, Ernesto Jimenez-Ruiz, Ian Horrocks, Denvar Antonyrajah, Ali Hadian, Jaehun Lee. Augmenting Ontology Alignment by Semantic Embedding and Distant Supervision. European Semantic Web Conference, ESWC 2021
[b] Claudia d'Amato, Nicola Flavio Quatraro, Nicola Fanizzi:
Injecting Background Knowledge into Embedding Models for Predictive Tasks on Knowledge Graphs. ESWC 2021: 441-457
[c] Jiaoyan Chen, Pan Hu, Ernesto Jimenez-Ruiz, Ole Magnus Holter, Denvar Antonyrajah, and Ian Horrocks. OWL2Vec*: Embedding of OWL ontologies. Machine Learning, Springer, 2021