Capturing Concept Similarity with Knowledge Graphs

Tracking #: 3022-4236

Filip Ilievski
Kartik Shenoy
Nicholas Klein
Hans Chalupsky
Pedro Szekely

Responsible editor: 
Harald Sack

Submission type: 
Full Paper
Robust estimation of concept similarity is crucial for a range of AI applications, like deduplication, recommendation, and entity linking. Rich and diverse knowledge in large knowledge graphs like Wikidata can be exploited for this purpose. In this paper, we study a wide range of representative similarity methods for Wikidata, organized into three categories, and leverage additional knowledge as a self-supervision signal through retrofitting. We measure the impact of retrofitting with subsets from Wikidata and ProBase, scored based on language models. Experiments on three benchmarks reveal that pairing language models with rich information performs best, whereas the impact of retrofitting is most positive on methods that originally do not consider comprehensive information. The performance of retrofitting depends on the source of knowledge and the edge weighting function. Meanwhile, creating evaluation benchmarks for contextual similarity in Wikidata remains a key challenge.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 01/Apr/2022
Minor Revision
Review Comment:

This paper concerns with the similarity of concepts in the context of knowledge graphs. The authors study a number of representative methods for the measurement of similarity and investigate and compare their performance, along with additional information from retrofitting, for large KG datasets. The paper is very well written overall with only a few minor oversights in terms of spelling and grammar.

The motivation for solving the problem of similarity is clearly defined in Section 1. I appreciate the background that Section 2 provides in the context of concept similarity.
I wonder, though, if Section 2 and Section 6 could be combined into one section, so as to provide an overview and background of the problem in the context of previous work related to it? I think it would be beneficial to discuss related work before explaining the framework to put the rest of the paper into perspective.

The primary concern and questions that came up while reading the paper are regarding the precise problem statement of the work. I understand the paper is aimed at looking at concept similarity in knowledge graphs. Does this mean that entities are completely excluded from this examination? Later in the paper (section 3.1), it is mentioned that entity abstracts were considered from the DBpedia KG, perhaps the terminology needs to be consistent and clear throughout the paper.

The authors have leveraged KG embeddings such as TransE and ComplEx as one of the measures for deriving similarity. These models generally do not include the concepts (the Tbox) of the KGs during their training and thus, do not provide concept embeddings by default. Considering that this paper is building up on their previous work, I would encourage the authors to include the details or specify clearly how the concept embeddings were obtained from KG embeddings for the datasets under consideration.

In Section 3.1, the authors would need to explain the rational behind their choices - why have the particular properties been chosen for lexicalization and not others? Why were the particular configurations for composite embeddings decided for consideration and not other combinations? The reasoning and motivation behind these decisions have not been explained, hence the approach seems a bit arbitrary.

Another major concern is regarding the overall contribution of the paper. While the authors do not claim to have devised a new method for computing concept similarity, the take-aways from this paper are not exactly well-defined. Is the goal to study various existing methods and compare their performance?
Since the experiments did not reveal any one method to be far superior than the others, what are the key insights that the authors have derived from this work that could be beneficial for future work, for example, which method would be best for a particular new dataset? or what kind of new metrics would be useful? Perhaps the authors can rephrase their contributions and findings to be more precise.

The link of a Github repository hosting the datasets and code for the paper, including a README file with instructions, has been provided. The artifacts seem to be complete and sufficient for replication of experiments.

Other comments

- Section 1 - The last line of introduction seems a bit out of place, perhaps this can be moved to a later part of the paper since here the readers are not familiar with the details of the framework.

- Section 3 -The link to the github repository has been provided here, it should be moved to the end of the introduction section.

- Section 4 - It has been mentioned that the annotation scores from 5 researchers were averaged. I wonder if this is a good approach to obtain the final score, if researcher1 gave a score of 1 (near identity) and researcher2 gave a conflicting score of 4 (unrelated), the average score will not reflect these conflicts at all. Could it be better to resolve these conflicts instead of averaging them? Perhaps the authors can explain if this is not necessary.

- Section 5.2, page 7, line 42 - the Absract -> the Abstract

- Section 6, page 9, line 5 - considered by he -> considered by the

Review #2
Anonymous submitted on 27/Apr/2022
Major Revision
Review Comment:

The paper presents a a study of similarity in Wikidata and the impact that retrofitting (subsequent training of embeddings to fit with external information, in this case similarity of entity pairs in Wikidata) can have on both KG and text-based embeddings similarity.

This is a very relevant topic with several interesting applications in multiple domains.

I believe the paper would be much improved by addressing the following issues:

1. Related work

There is avery relevant paper in this area that is not covered in the related work:
Lastra-Díaz, Juan J., Josu Goikoetxea, Mohamed Ali Hadj Taieb, Ana García-Serrano, Mohamed Ben Aouicha, and Eneko Agirre. "A reproducible survey on word embeddings and ontology-based methods for word similarity: linear combinations outperform the state of the art." Engineering Applications of Artificial Intelligence 85 (2019): 645-665.

This is a fully reproducible paper, and producing results in the same conditions for the proposals presented in the manuscript would increase its value substantially.

Moreover, this work combines language models and KG embeddings, but does not cover the related work that covers this overlap.
Wang, Zhen, Jianwen Zhang, Jianlin Feng, and Zheng Chen. "Knowledge graph and text jointly embedding." In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1591-1601. 2014.
Xie, Ruobing, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. "Representation learning of knowledge graphs with entity descriptions." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1. 2016.
Peters, Matthew E., Mark Neumann, Robert L. Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. "Knowledge enhanced contextual word representations." arXiv preprint arXiv:1909.04164 (2019).

2. Clear definitions
The concept of retrofitting is only presented very late in the text. This is not something most readers will be familiar with and it does represent an important aspect of the work. It should be defined in the introduction to help understand the goals. A definition of KG embeddings and text embeddings is also lacking. Although these are increasingly commonplace, and soft introduction to these terms would improve the readability.

3. More focus on novel contributions
Retrofitting appears to be the more original aspect of the work. However, this is not described or analysed in depth. Results are only shown for cosine similarity of DistilRoberta embeddings.
How the pairs are built is not very well described. The word edge I believe is used to mean a pair of concepts. The selected pairs are in all likelihood quite similar (parents and siblings), and the distribution of similarity for these pairs is not studied.

"We focus our experiments on cosine similarity as a
weighting function, because we observed empirically that it consistently performs better or comparable to the other
two weighting functions." This is a pity. This is exactly what I was hoping to find in the paper. In the end, I am unsure if there is any real advantage of using wikidata to measure conceptual similarity, or if we are simply better off just using language models.

The analysis of the quartiles is potentially quite interesting, but results are not easy to read (no table), and now the performance metric is F-measure, which is not at all clear how it is computed.

I strongly advise the authors to apply their retrofitting method in the same datasets and conditions of Lastra-Diaz et al.

4. Justifications and clarification of methodological aspects

How TopSim is computed is not clear at all. Is this the measure proposed in 10.1109/ICDE.2012.109?

Why does Composite-6 not include labels and desc?

Why is DistilRoberta used for abstract, labels, label+description and BERT-base for lexicalization?

5. The large size of wikidata is referred to multiple times, but the application was at most to a few hundred entity pairs, what are the true implications of the large size of Wikidata for similarity estimation?).

Minor: it would be better to employ the terminology defined by Lastra-Diaz et al when categorizing the different similarity metrics.

In summary, there is an interesting idea in applying retrofitting to concept similarity with KGs. However, the paper does not consider related work appropriately, which limits the value of its contributions (see Lastra-Diaz et al). It also does not afford sufficient detail in the description of the methods and choices, and could be much richer in terms of tested configuration, presented results, and discussion.

Review #3
Anonymous submitted on 29/Sep/2022
Major Revision
Review Comment:

In this work, the authors propose a framework for concept similarity in Knowledge Graphs with a focus on Wikidata. The authors focus on studying the impact of different similarity methods, language models, and knowledge graph embeddings for concept similarity. Retrofitting technique is deployed to the embeddings which iteratively updates the node embeddings to bring their connections closer using an external dataset. In this paper, the authors adapt retrofitting to tune embeddings from KGs and Kms to Wikidata and ProBase.
The framework has been evaluated on three different benchmarks: WD-WordSim353, WD-RG65, and WD-MC30.

1. The paper is well written and easy to follow, and different combinations of framework variants have been used in the evaluation against the three benchmarks.
2. The approach has been well explained using the diagram.
3. The code and data are made available by the authors.
However, I have the following concerns about the paper:

1. The paper has a brief description of the related work, however, some of the works on concept similarity have not been mentioned [1,2]. Also, the gap in the current research and the proposed model has not been clearly pointed out.
1. Alkhamees, M.A., Alnuem, M.A., Al-Saleem, S.M. and Al-Ssulami, A.M., 2021. A semantic metric for concepts similarity in knowledge graphs. Journal of Information Science, p.01655515211020580.
2. Zhu, G. and Iglesias, C.A., 2016. Computing semantic similarity of concepts in knowledge graphs. IEEE Transactions on Knowledge and Data Engineering, 29(1), pp.72-85.
2. The results of the framework have been presented using different variants which are its features and different embedding models for the benchmarks. But the framework has not been compared with any of the existing baseline models. Therefore, it becomes difficult to analyse the effectiveness of the model and its utility. I would strongly recommend the authors compare their model with the state-of-the-art models in concept similarity. Otherwise, the impact of the obtained results could not be understood.
3. The framework comprises node embedding models such as DeepWalk as well as KG embedding models such as TransE. It is not mentioned how the node embedding models are trained on a KG. Is the relation information ‘r’ in a triple ignored? If yes, why? It would be nice to have an explanation of how these models are trained. Furthermore, statistics of the KG trained are missing, i.e., what is the size of the KG in which these Node embedding and KG embedding models are trained, in terms of no. of triples, no. of entities, no. of relations, etc.? Also, the readability will be improved if a few real examples of the triples used in the KGs are provided in the paper while providing the explanation.
4. The abstract, labels and label + descriptions have been used in the framework. I assume that “abstract” and “description” are the same thing. However, it creates confusion. If it is the same please use uniform annotation, otherwise, I would recommend pointing out the difference in the paper.
5. There are no detailed details about DBpedia-RG 65, and DBpedia-MC 30. In what ways are they different from WD-RG65, and WD-MC30?
6. From where are the abstract/descriptions are extracted for the Wikidata dataset in Table 3? Are they extracted from Wikipedia? If yes, then why the results are not the same as DBpedia in Table 2 because it consists of the same abstracts from Wikipedia?
7. I would recommend the authors list down the major contributions by the end of the introduction section, to improve the readability. Also to provide the related work at the beginning, so that the research gaps are understood before delving into the contribution made by this paper.
8. Minor: The link to google drive is available only after making a request. It would be nice to have it freely accessible without access request.