A Study of Concept Similarity in Wikidata

Tracking #: 3520-4734

Filip Ilievski
Kartik Shenoy
Hans Chalupsky
Nicholas Klein
Pedro Szekely

Responsible editor: 
Harald Sack

Submission type: 
Full Paper
Robust estimation of concept similarity is crucial for applications of AI in the commercial, biomedical, and publishing domains, among others. While the related task of word similarity has been extensively studied, resulting in a wide range of methods, estimating concept similarity between nodes in Wikidata has not been considered so far. In light of the adoption of Wikidata for increasingly complex tasks that rely on similarity, and its unique size, breadth, and crowdsourcing nature, we propose that conceptual similarity should be revisited for the case of Wikidata. In this paper, we study a wide range of representative similarity methods for Wikidata, organized into three categories, and leverage background information for knowledge injection via retrofitting. We measure the impact of retrofitting with different weighted subsets from Wikidata and ProBase. Experiments on three benchmarks show that the best performance is achieved by pairing language models with rich information, whereas the impact of injecting knowledge is most positive on methods that originally do not consider comprehensive information. The performance of retrofitting is conditioned on the selection of high-quality similarity knowledge. A key limitation of this study, similar to prior work lies in the limited size and scope of the similarity benchmarks. While Wikidata provides an unprecedented possibility for a representative evaluation of concept similarity, effectively doing so remains a key challenge.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 25/Aug/2023
Review Comment:

In this version, the authors have addressed the concerns and remarks I mentioned previously. I think the paper is now improved.

- The reason for selecting the approaches TransE, ComplEx, and DeepWalk is now justified in the revised paper.

- My previous request to provide the statistics of the datasets used to train the KG embedding models and the Node embedding methods has been addressed by providing Table 2. However, it would be good to also provide the slits for train and test.

Review #2
Anonymous submitted on 04/Sep/2023
Review Comment:

The paper reads very well now, all reviewer comments have been properly and diligently addressed. Great work!

Review #3
Anonymous submitted on 05/Oct/2023
Review Comment:

I would like to thank the authors for taking into account the issues brought up and including the pertinent information, as was mentioned in the prior review.
They have thoughtfully addressed the issues raised about how generalised the proposed approach is and how the outcomes produced in this study, namely the achieved related ideas in Wikidata, might be utilised in future research. Therefore, I would like to accept the paper.