Weight-aware Tasks for Evaluating Knowledge Graph Embeddings

Tracking #: 3522-4736

Authors: 
Weikun Kong
Xin Liu
Teeradaj Racharak
Guanqun Sun
Qiang Ma
Le-Minh Nguyen

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
Abstract: 
Knowledge graph embeddings, representing entities and relations using vectors or matrices, widely participate in solving various problems together with deep learning, such as natural language understanding and named entity recognition. The quality of knowledge graph embeddings highly affects the performance of the models on many knowledge-involved tasks. Link prediction (LP) and triple classification (TC) are widely adopted to evaluate the performance of knowledge graph embeddings. Link prediction is to predict the missing entity that completes a triple, which represents a fact in knowledge graphs, while triple classification is to determine whether the unknown triple is true or not. Both link prediction and triple classification can intuitively reflect the performance of the knowledge graph embedding model; however, it treats every triple equally, which is not capable of evaluating the performance of the embedding models on knowledge graphs that offer the weight information on the triples. As a consequence, this paper originally introduces two weight-aware extended tasks for LP and TC, called weight-aware link prediction (WaLP) and weight-aware triple classification (WaTC), respectively, aiming to better evaluate the performance of the embedding models on weighed knowledge graphs. WaLP and WaTC emphasize the ability of the embeddings to predict and classify triples with high weights, respectively. Lastly, we respond to the newly introduced tasks by proposing a general method WaExt to extend existing knowledge graph embedding models to weight-aware extensions. We test WaExt on four knowledge graph embedding models, achieving competitive performance than the baselines. The code is available at: https://github.com/Diison/WaExt.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject (Two Strikes)

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Erik B. Myklebust submitted on 27/Sep/2023
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

This paper introduces methods for evaluating weighted KGE models on two tasks, link prediction and triple classification. Furthermore, they introduce a weighted extension to KGE model scoring functions to take triple weights into account. The novelty is alright, but is limited to the application of weighting the scoring functions in KGE models. Scoring function/loss function weighting is very common elsewhere in ML research/applications.
How would you apply this to KGs where some triples have weights and others don’t? (You don’t need to include this in the paper, just a thought to follow up on in the future. )

The results show improvement with the new method. Are the results a single model run or an average of many? Repeating with new embedding initializations would help in evaluating significance of the approach.

I would like to see a discussion on how this would impact certain link predictions, preferably with examples. This would provide the reader with context.

Overall, the paper is well written and fully understandable.

Minor:
- 4-38 punctuation.
- 4-40 no transition.
- Section 2.2 doesn’t flow well and you need to link all references to your work.
- 7-21 punctuation
- 8-17 proper subscript please, P and R are fine to use for precision and recall.

Review #2
Anonymous submitted on 08/Oct/2023
Suggestion:
Major Revision
Review Comment:

In this article, the authors introduce the tasks of weight-aware link prediction and triple classification for knowledge graph completion. They consider triples associated with a weight ((s,p,o), w) and adapt the task of Link Prediction (LP) and Triple Classification (TP) to consider these weights in the evaluation, by introducing weight-aware metrics: WaMR, WaMRR, WaHits@N, and WaF1. They also introduce a framework extending knowledge graph embedding to consider weights during the training of models. It is noteworthy that this framework seems general (see below) and could potentially be applied to several models, which the authors show in their experimental evaluation that considers TransE, TransH, DistMult, and ComplEx.

Overall, I appreciate the idea of weight-aware LP and TC as well as extending metrics and training frameworks in this objective. However, I think the paper in its current form misses fundamental explanations for the reader to clearly understand the implication of the proposal. I also have several theoretical doubts about the proposal. That is why I recommend a major revision.

# Major comments

- Section 2.2 focuses on FocusE, which is of importance for the paper as it is a direct competitor. However, the section does not provide enough details for the reader to have a clear intuition of the model. For example, the definition of negative triples l^- is not given. I also think equations should be further described, detailed, and exemplified. Furthermore, the conclusion of the descriptions of each paper 33, 34, 35, 36 is rather weak. I think the authors could provide here a clear positioning w.r.t. their proposal instead of a rather general statement.

- In Section 3, authors mention "weight-aware LP an weight-aware TC have been introduced". Do you mean they have been introduced by other authors before? Or do you mean that you introduce these tasks in this paper?

- Section 3.1.2: the role of the activation function g is not enough described here to understand its impact on the task at hand. Some explanations are provided in the experimental section but this is too late to clearly understand the contribution in 3.1.2 (e.g., Fig 5 could be here). Furthermore, I don't understand how the definition of r^w_i allows the models to focus on high weight triples. Indeed, with a linear g:
* A triple ranked 1 with a weight of 100 will be reranked 0.01. If it is ranked 100, it will be reranked 1
* A triple ranked 1 with a weight of 2 will be reranked 0.5. If it is ranked 100, it will be reranked 50
The error on triple with weight 100 is actually smoothed by the re-ranking. As such, I am missing the intuition of why this re-ranking makes models focus on high-weighted triple.
The constant u is not defined in later formulas. Additionally, the need for a normalization factor is not motivated. I also wonder whether it still makes sense to compute WaHits@N since ranks can now be real-valued numbers? That is why I think additional intuitions and examples should be given to describe the behavior of WaMR, WaMRR, and WaHits@N.

- Section 3.2.2: I have the same remarks about the lack of intuitions and examples as for Section 3.1.2. Additionally, why does the normalization factor is needed, and why is it used as a denominator here contrary to 3.1.2 where it is used as numerator?

- Section 3.3: the authors set the weight w' of negative triples as a hyper-parameter. I don't argue the difficulty to choose a weight for non-existent triples. However, I think this could be further discussed and explained.
Additionally, your work rely on the pairwise hinge loss. Could other losses be considered? Why choosing this one? How can this framework be applied to KGE models that do not rely on the PH loss? I also think here the intuitions behind this extension of the PH loss could be better exemplified.
I also have a theoretical doubt about this loss. Indeed, DistMult and ComplEx try to maximize the score the positive triples w.r.t. negative ones (if I am correct). Minimizing your loss comes down to minimizing the score of positive triples and maximizing the score of negative triples. I am thus curious about this point, whether it respects the original behavior of models and whether it could be applied to them, and so compared with them in the experimental section.

# Minor comments

- Several times (in abstract and in the text), you mention that "Link Prediction and Triple Classification are widely adopted to evaluate the performance of knowledge graph embeddings". I would argue that some models are specifically designed for LP and TC while some others are not (e.g., RDF2Vec). Evaluating the performance of KGE through LP and TC is only one possible way to go and do not represent a holistic evaluation of KGE, especially since other metrics are being introduced, e.g., Sem@K [1], CO2/energy consumption [2], explainability [3] etc.

- "Indeed, the traditional evaluation metrics, such as Link Prediction and Triple Classification": LP and TC are not metrics. Did you mean "tasks"?

- The subsection definition "1.0.1 Twofold contributions" could be removed, and paragraphs simply added to Section 1 - Introduction

- Incomplete sentence "The hyper-parameter \beta \in [0,1]."

- Section 2.3.3 title is "Tail Entity Prediction" is strange as it seems to present the task of link prediction in an uncertain KG.

- p6, l27, a comma is starting the line.

- Figure 3 is not commented and I am therefore not sure of its usefulness.

- Figure 5 is interesting but should come earlier in the paper. The various epochs considered for the dynamic base should appear on the figure to visually understand their impact.

- Section 4.3: what is "WeExt"?

- Table 2, 3, 4: Wa[MR, MRR, Hits@N] are not used but their traditional counterparts are. Why?

- I do not understand Figure 6. I think axes should be labeled.

# References

[1] Nicolas Hubert, Pierre Monnin, Armelle Brun, Davy Monticolo. Sem@K: Is my knowledge graph embedding model semantic-aware?
[2] Xutan Peng, Guanyi Chen, Chenghua Lin, and Mark Stevenson. Highly efficient knowledge graph embedding learning with Orthogonal Procrustes Analysis.
[3] Andrea Rossi, Donatella Firmani, Paolo Merialdo, and Tommaso Teofili. Explaining link prediction systems based on knowledge graph embeddings

Review #3
Anonymous submitted on 29/Oct/2023
Suggestion:
Minor Revision
Review Comment:

I would like to thank the authors for providing the answers to the review comments and the updated version of the manuscript, including the extension of Section 2. The extended version of Section 2 includes short summaries of methods described in papers mentioned in my original review. The authors decided not to include experimental comparison with any of these pre-existing methods, as some of the methods were designed for different tasks and one considered applicable resulted in unsatisfactory performance.

I still think that pre-existing methods dealing with weighted graphs could be discussed in more detail: those 4 papers were merely examples, which could be found quickly, but not a comprehensive list (just as an example of another paper worth mentioning, see [1] below). It would also be good to have a more contextualised summary of existing work: for instance, mentioning explicitly why existing methods are not suitable or not comparable to the proposed approach (e.g., why ProbWalk would be unsuitable, rather than stating “This innovative approach provides valuable insights into the utilization of transition probabilities as a means to capture the underlying structure of the graph in the embedding process.”).

In view of this, I am still not fully sure about the extent of the added value of the proposed evaluation protocols and weight-aware extensions.

[1] Grassia, M. and Mangioni, G. wsGAT: Weighted and Signed Graph Attention Networks for Link Prediction, 2021: https://arxiv.org/abs/2109.11519

Review #4
By Heiko Paulheim submitted on 30/Oct/2023
Suggestion:
Major Revision
Review Comment:

The authors have taken some efforts in order to revise their paper, which are much appreciated.

I still have mixed feelings about the evaluation. For tables 2-4, it is unclear to me what they really show, since the explanations are not sufficient.
* Table 2 is explained to show the results of the standard link prediction task, so I guess that weights are only used on the training set, while the test set is unweighted, and standard forms of MR etc. are used?
* Table 3 is explained to show the results of the weighted link prediction task. So, shouldn't the headline read "MaWR", "MaWRR", etc.? If yes, which weights are used for the standard versions of TransE etc.?
* In their response, the authors state that they "use a unified weighting function for evaluating the model" - but which one did they use?

The experiments are also underspecified - which base is used for g(x) in the dynamic weighting function? Which static base is used for the static weighting function? Are they always constant, or did the authors tune that parameter for the evaluation per dataset and embedding approach?

Some of the details in the metrics are still cumbersome. For the definition of r_i^w (bottom of page 6), it seems like a higher weight of a statement leads to a lower ranking. Why is this? A running example might help here.

Another problem I have is with the WaHits@N: the weighted rank is used here. If the ground truth statement at hand has a low weight, it is harder to end up among the top N than if it has a high weight. This seems a bit odd.

Smaller issue: the denominator "u" can be omited in the normalization factor c when pulling it in front of the sums for WaMR etc. This simplifies the notation.

One thing I struggle with is the introduction of the activation function g. In my opinion, it is never clearly motivated why simply using the weights as they are (i.e., g(x)=x) should be inferior. This should be better motivated and also be included as a baseline. Moreover, the chosen activation functions lead to unintuitive final metrics, like F1 scores above 1, which can be very confusing. In evaluations, it would be better to have weighted evaluation metrics that are on the common ranges (i.e., between 0 and 1 for precision, recall, and F1).

Overall, in the evaluation, I miss some analysis beyond the pure presentation of numbers. For example, it is clearly visible that TransH benefits much more from the incorporation of weights than ComplEx. I miss some statements about what makes some approaches benefit more from using weights than others. The same holds for the observations with FocusE - it would be interesting to elaborate why it does not work as well as WaExt.

This analysis would be particularly interesting w.r.t. different semantic meanings of weights in the datasets. For NELL, the scores are confidences, where a lower scores indicates a higher likelihood of a statement being wrong. For PP15k, they are probabilities, which is semantically different. On the other hand, in ConceptNet, their semantics is more vague [1].

As far as the comparison to FocusE is concerned, the authors state that the numbers are not directly comparable, since they use another implementation. While I can accept that, the numbers for plain TransE and plain DistMult are *really* far from the numbers in the other table, they sometimes differ by an order of magnitude. It is hard to accept that such strong deviations should come only from different implementations.

What would be relevant as well is the distribution of weights across relations. If they are not uniform, they would guide the model to put more attention on a subset of the relations. Given that high-weight edges are also more frequent in the test set, the result could also be partly explained by the fact that the model puts more attention on relations which are relevant for the evaluation.

When thinking from the end of motivation of the work at hand, I wonder if it would not make sense to evaluate the weighted triple classification part the other way around. In the current version, the authors weight the high-weight (which, in most cases, equals: high confidence) triples higher. Actually, an application of the approach would be more interesting if it has a high accuracy in scoring low confidence triples (i.e., using the approach to discard those).

Regarding related work, there are also weighted versions of other embedding methods, such as RDF2vec [2].

Finally, I am happy that the authors try to address the weight issues, but I find it difficult to promise some of those works (particularly on the F1 score) only for the final version, and to base my assessment on that promise.

Overall, the paper still leaves many question marks. The evaluation could be more in depth, also qualitatively. I suggest a thorough rework before resubmitting to SWJ or another journal.

[1] https://github.com/commonsense/conceptnet5/issues/152
[2] Michael Cochez, Petar Ristoski, Simone Paolo Ponzetto, Heiko Paulheim: Biased Graph Walks for RDF Graph Embeddings.