Farspredict: A Benchmark Dataset for Link Prediction

Tracking #: 3517-4731

Najmeh Torabian
Behrouz Minaei-Bidgoli
Mohsen Jahanshahi

Responsible editor: 
Mehwish Alam

Submission type: 
Full Paper
Link prediction using knowledge graph embedding (KGE) is a popular method for completing knowledge graphs. Moreover, training KGEs on non-English knowledge graphs can enhance knowledge extraction and reasoning within the context of these languages. However, several challenges in non-English KGEs hinder the learning of low-dimensional representations for a knowledge graph's entities and relations. This paper proposes "Farspredict," a Persian knowledge graph based on Farsbase, the most comprehensive Persian knowledge graph. It also explains how the knowledge graph structure affects link prediction accuracy in KGE. To evaluate Farspredict, we implemented popular KGE models on it and compared the results with those of Freebase. After analyzing the results, we carried out some optimizations on the knowledge graph to improve its functionality in KGE, resulting in a new Persian knowledge graph. The implementation results of KGE models on Farspredict outperformed Freebase in many cases. Lastly, we discuss possible improvements to enhance the quality of Farspredict and the extent to which it improves.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 18/Sep/2023
Review Comment:

The paper proposes a new benchmark dataset, FarsePredict, for link prediction. The new dataset is in Persian language and created from the KG Farsebase. Farsebase is created by extraction and integration of knowledge from a variety of sources, e.g. the Persian Wikipedia.
In the initial experiments, standard KG embedding methods show very low performance on Farsebase. Therefore, the authors present some cleaning mechanisms to obtain a denser KG that is better suited for link prediction.

My main critical point about this work is the motivation for this work: Why do we need a language-specific link prediction dataset if link prediction algorithms are language-independent? All methods evaluated in this work do not consider the labels of entities or relationships, nor do they use literal values. They only work on the graph structure. Hence, the language of the input graph is completely irrelevant for them to perform well.
This could be more interesting, if you would consider also link prediction methods that can make use of this information, e.g., [1].
Furthermore, your dataset seems to have some characteristic that makes link prediction algorithms hardly work on it. It would be interesting to compare your dataset to other datasets with the usage of other metrics to see why the performance on your dataset is so low. Maybe have a look at the datasets implemented in PyKeen [2].

I think this work can improved a lot by a more compelling motivation for the need of a Persian language link prediction dataset. Furthermore, this would require a more extensive comparison to other existing datasets. Since the performance of existing link prediction models on FarsePredict is so low, I would also consider to further filter the dataset.

- Most parts are easy to understand.
- A large number of KG embeddings have been evaluated.
- The link prediction quality on FarsePredict is very low.
- All link prediction algorithms presented in this work are language-independent.
- Also, other KGs, e.g. Wikidata and DBpedia have labels in other languages. So, what exactly is the point of introducing this instead of extending the Persian DBpedia/Wikidata version?
- A comparison to other existing link prediction datasets is missing.

Detailed Comments:
-Page 1, Line 40: What does “too weak for link prediction” mean? Try to be a bit more concise.
- Page 2, Line 7: I would expect a reference here, like the other methods mentioned before.
- Page 2, Line 37: All big KGs have an international version. Also, I am pretty sure that Google’s KG is not purely in English since it also serves all language-specific Google services.
-Page 3, Line 22: You are twice mentioning Sakor et al.
-Page 5, Line 23: How is Farsebase structured? It is unclear what exactly you are changing. What exactly do you mean by “not properties or anything else”?
-Page 4: I do not understand, why Farsebase has 7378 relations, with many of them being very rare. Is Farsebase an open KG that has been extracted from the text?
-Page 5, Line 31: What is “unsupervised text”? This term is unknown to me. Rephrase or add a small explanation.
-Page 5, Line 36 and following: The given paragraph about the initial experiments as a motivation for the next steps is at an unexpected position in the section and I took a while to understand, why you suddenly jump to link prediction experiments, while just writing about the shortcomings of Farsebase. Consider restructuring or more explicitly mentioning that you are performing initial experiments.
- Page 6, Line 14: What means “valid Farsebase for link prediction”? Try to add an explanation.
-Page 6, Line 24: How do you remove non-Persian entities and relations? And what exactly are those? Entities and relations that do not have a Persian language label?
- Page 7, Table 2: Would be more helpful if you could compare this to existing link prediction datasets.
-Page 8, Table 3: The results are extremely bad. It would be very helpful to also add model results on other datasets to get a direct comparison.
-Page 9, Line 25: “Graph connectivity was effective” I do not understand what this means.
-Page 9, Line 31: “Hypothesis 1 suggests that the presence of entities that are rarely used in Wikipedia reduces the likelihood of selecting a new valid triple”. I do not understand this sentence and I also am not sure, why the presence of an entity in Wikipedia is relevant to the performance of the models in link prediction.

[1] Daza, Daniel, Michael Cochez, and Paul Groth. "Inductive entity representations from text via link prediction." Proceedings of the Web Conference 2021. 2021.
[2] https://github.com/pykeen/pykeen

Review #2
Anonymous submitted on 06/Oct/2023
Review Comment:

The paper titled ‘Farspredict: A Benchmark Dataset for Link Prediction’ attempts to provide a dataset for the link prediction task using knowledge graph embedding models. Some of the authors have been involved in creation of a Persian KG in 2019, and in this work, they use nine knowledge graph embeddings models on top of a version of this knowledge graph that is more a data graph, after all those filtering steps. The data graph is of course disconnected and can not be even called a data graph, so, in order to prepare for KGE models, a filtering/removing/connecting process, dubbed Farspredict, is described. For this the authors are presenting ‘Algorithm 1’, which is basically a manual description of the discrete steps one would follow (I disagree to call this an algorithm). Finally the performance results of running KGE models are reported in the data. The results are pretty low, and there is no information on the hyperparameter search of the evaluation. The motivation of the work is also not clear, which part of the data was incomplete that the authors tried to use KGEs to complete and what is the outcome, are all unclear.

Questions to the authors:
I would like to know why the authors state ‘training KGEs on non-English knowledge graphs can enhance knowledge extraction’ and never in the entire work show how the use of KGEs improved on knowledge extraction. Extraction is not the same as link prediction. In addition, why do the authors state ‘​​Knowledge graphs are valuable resources that provide the possibility of extracting knowledge from textual sources’!
‘Overall, compared to standard dataset results, the results presented in Table 3 show that the mean rank of Farspredict link prediction is higher than standard datasets.’ -> none of the benchmark datasets are used in this work - what does the author mean by this sentence?
Why a very inconsistent way of representing results chosen in Table 3?

The paper requires a substantial language check, some parts include words that only a language model could use, some parts very weak. Many inappropriate phrasing (only a few examples below):

we implemented popular KGE models on it -> you do not implement the models on a benchmark, but evaluate
Several partly human efforts -< you mean partially?
General use of ‘’The’’ is problematic in the paper: A knowledge graph completion is a valuable technique -> ‘The’ knowledge graph completion and it is not a technique, it is a downstream AI task.

Review #3
By Afshin Sadeghi submitted on 23/Jan/2024
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions, which include

(1) originality,

- the dataset is an extension of another dataset and brings little novelty.
- the process of the KG is just a pre-processing of a dataset and does not do a major contribution.

(2) significance of the results,

- the tested methods are outdated and old. The newer Methods like CompGCN and GFA-NN predict much better on sparse KGs and being sparse is not a big challenge for newer methods.

- the evaluation not show how good is the provided method but tests KGE methods.

and (3) quality of writing.

- the provided related work are famous KGs but not famous benchmarking KGs.
- the given description of KGE method types is redundant.
- A comparison to KGE benchmarking datasets is missing.
- Contribution and the goal of the paper is not clear and very similar to a paper of KGE not KG benchmarking dataset.

Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B)

- I did not find a link in the paper that provides access to the dataset. For a dataset paper, this is a major condition to reject the paper.

Whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.