Sem@K: Is my knowledge graph embedding model semantic-aware?

Tracking #: 3345-4559

Nicolas Hubert
Armelle Brun
Pierre Monnin
Davy Monticolo

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
Using knowledge graph embedding models (KGEMs) is a popular approach for predicting links in knowledge graphs (KGs). Traditionally, the performance of KGEMs for link prediction is assessed using rank-based metrics, which evaluate their ability to give high scores to ground-truth entities. However, the literature claims that the KGEM evaluation procedure would benefit from adding supplementary dimensions to assess. That is why, in this paper, we extend our previously introduced metric Sem@K that measures the capability of models to predict valid entities w.r.t. domain and range constrains. In particular, we consider a broad range of KGs and take their respective characteristics into account to propose different versions of Sem@K. We also perform an extensive study of KGEM semantic awareness. Our experiments show that Sem@K provides a new perspective on KGEM quality. Its joint analysis with rank-based metrics offer different conclusions on the predictive power of models. Regarding Sem@K, some KGEMs are inherently better than others, but this semantic superiority is not indicative of their performance w.r.t. rank-based metrics. In this work, we generalize conclusions about the relative performance of KGEMs w.r.t. rank-based and semantic-oriented metrics at the level of families of models. The joint analysis of the aforementioned metrics gives more insight into the peculiarities of each model. This work paves the way for a more comprehensive evaluation of KGEM adequacy for specific downstream tasks.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Erik B. Myklebust submitted on 21/Apr/2023
Minor Revision
Review Comment:

Sem@K: Is my knowledge graph embedding model semantic-aware?

This paper builds on previous work to create a metric to assess knowledge graph embedding and link prediction models' ability to predict semantically consistent triples.

An extensive evaluation of popular knowledge graph embedding models over several datasets is provided.

The work is of high significance as there is a need for ways to evaluate KGEMs under semantic constraints. The design of metrics like Sem@k enables the construction of models that are optimized to preserve semantic relations in knowledge graphs.

The paper is clearly written and is well organized. I appreciate that the research question are presented and the steps to evaluate them is provided.

Specific comments:
1. Think about changing the title. Questions don't make great titles as usually the answer would be "no".
2. RQ1 can be formulated better. "Agnostic to semantics" makes more sense, not sure what "aware" means here.
3. Section 3.2 repeats the RQs. That's not necessary. Leave them to the introduction.
4. Section 4 mentions the delta to [14,15]. This would do better to include in the introduction.
5. Section 4.1 is blank. Just make 4.1.1 -> 4.1.
6. Section 5.2: You mention HAKE (earlier), but it's not used as a baseline. Wouldn't a model constructed with the purpose of preserving hierarchies perform well on Sem@k metric? Please justify why it's not included.
7. Section 5.3: You only use 1 corrupt triple per true triple. What is the reasoning here? ConvE use |E|-1 in the original implementation, right? I would expect similar results as the model would see the same model of corrupt triples over 4000 epochs with 1 corrupt triple as for 400 epochs of 10, but would it not impact the optimization?
8. Section 6.1, Fig4. RGCN -> R-GCN.
9. Section 7: You mention good Sem@k at epoch 1, why? Does the metric weight far ancestors to much, and therefore, predicting top classes is rewarded?

The datasets are provided on GitHub, this is fine, but not ideal. If the major contribution comes from a PhD student, the data (and code) should be moved to a less volatile location, maybe related to the organization rather than an author.

I see links to KGE repositories, but is there code for reproducing the results from the paper?
Ideally a single script that runs all evaluation to produce tables and figures from the paper.

Review #2
Anonymous submitted on 11/Jun/2023
Minor Revision
Review Comment:

The paper presents novel methods for evaluating knowledge graph embedding models with respect to their ability to predict semantically meaningful triples, i.e., triples which satisfy domain and range constraints that are either specified by the KG schema or obtained relying on the data in the KG itself. The authors propose several variations of the respective sem@K metrics extending their earlier work [*], and perform extensive evaluation of popular knowledge graph embeddings with respect to their "semantic-awareness" relying on the proposed metrics on a number of standard datasets with slight adaptations.

[*] N. Hubert, P. Monnin, A. Brun and D. Monticolo, Knowledge Graph Embeddings for Link Prediction: Beware of Semantics!, in: DL4KG@ISWC 2022: Workshop on Deep Learning for Knowledge Graphs, held as part of ISWC 2022: the 21st International Semantic Web Conference, Virtual, China, 2022.

The need for extending evaluation protocols of embedding models with metrics that estimate how well KG embeddings capture semantic information in the KG has been acknowledged in a number of works cited by the authors. The considered problem is definitely timely, relevant, and perfectly fits the scope of the Semantic Web journal. The introduced sem@K metrics are very natural suggestions for the considered task capturing a simple idea to measure how well KG embeddings preserve the domain and range restrictions. The main contribution of the work, in my opinion, is extensive systematic empirical evaluation of families of KG embedding models with respect to the introduced metrics. Generally, the technical part of paper is well-written and the examples throughout the work help the readers to grasp the introduced concepts.

There are several questions/suggestions for improvement from my side:

- The authors discuss related works that also measure the semantic-awareness of KG embedding models, but do not directly compare the respective metrics to the introduced ones. It seems that inc@K metric from [5] reflects the same intuition as sem@k[base] when the ontology only contains domain and range restrictions as well as class disjointness axioms?

- As the main contribution of the paper seems to be the extensive evaluation of the models with respect to their semantic-awareness, the evaluation section might need to be improved a bit to help the readers grasp the main message of the paper. While the authors summarize some of the observations in the text, it is often difficult to extract the messages from numerous tables. Probably bar charts instead of (or additionally) to tables presenting rank-based and semantic-based metrics could be helpful. In order not to make the plots too overloaded and keep the results digestible, it might be sufficient to only report hits@k (resp. sem@k) for a single k.

- The provided GitHub link contains the datasets used in the experiments; these seem to be complete. It would be helpful to also share the implementation of the introduced evaluation protocols along with the README file in order to ensure the reproducibility of the results.

- While the schema of the KG and the type hierarchy are definitely the most immediate choices for semantic artifacts that can be considered in the evaluation of the semantic-awareness of KG embeddings, in the general case KGs can be accompanied with more expressive ontologies. It might be worthwhile including a discussion on the possible extension of the proposed metrics to also account for such ontologies. For example, an ontology axiom might state that "presidents live in capitals", in which case given that Joe is known to be a president in the KG, the prediction "Joe lives in Chicago", would not reflect the respective axiom, while still being semantically correct with respect to the domain and range restrictions of the "livesIn" relation. Another aspect is concerned with evaluating whether KG embedding models are predicting combinations of facts that are not contradicting each other. Each fact might be perfectly valid based on the ontology when considered on its own, but the combination of predictions could violate the schema/ontology. Extending the above example, the model might make two predictions "Joe lives in Chicago" and "Joe has profession president". Each prediction considered in isolation is meaningful and semantically correct, but jointly they do not follow the above axiom.

- In the current version of the paper the authors only restrict themselves to the entities, for which types are specified in the KGs. It is generally a bit of a limiting factor (which authors also admit). In principle, the KG embedding models are also capable of predicting types themselves. Thus, generalizing the proposed metrics to account for combinations of predictions seems to be a rather intuitive and natural extension.

- While it might be too demanding to ask for the inclusion of the extensions of the proposed metrics suggested above in the main part of the paper and experiments, I think having a broader view on the concept of semantic awareness touching upon the respective directions of considering mutual predictions made by the model and including more expressive ontologies could be helpful. This can be done in a separate Discussion section, for example.

Additionally, further careful proof reading should be done, as there are quite some typos/grammatical inaccuracies left in the paper:

- Abstract: "Its joint analysis with rank-based metrics offer" -> "...offers"
- p. 2 Fig. 2 is referred as a motivating example, but in the text it appears much later (p. 8), this is rather unusual, I think it would be more intuitive to have the motivating example in the beginning of the paper already, where it is referenced for the first time.
- p. 6: "...values increases..." -> "...values increase..."
- p. 7: "... it is assumed the test set only comprises..." -> "... it is assumed that the test set only comprises..."
- p. 7: "...Model B semantic awareness" -> "...semantic awareness of the Model B..."
- p. 7: "As aforementioned..." -> "As mentioned above..."
- p. 10: " the number of edges linking c to c'..." -> " the length of the path from c to c'"?
- p. 11: "Accordingly to Section 4.2.1..." -> either "According to Section 4.2.1" or "As discussed in Section 4.2.1..."
- p. 13: "...the semantic awareness of the most popular KGEMs are analyzed." -> "... is analyzed"
- p. 13: "...with d a distance function..." -> "...where d is a distance function..."
- p. 16: "...and are provided..." -> "...are provided..."
- p. 17 "...are better able at recovering..." -> "...are better capable of recovering..."
- p. 18 on Fig. 5 (c) ComplEx seems to be missing? Is there a particular reason for that?
- p. 18 "Where translational and semantic matching models treat..." -> "While translational and semantic matching models treat..."
- p. 18 "...a trade-off exist" -> "...exists"
- p. 19 "...the most of KGEMs reaches..." -> "...the most of KGEMs reach..."
- p. 20 "...with a hierarchy class..." -> "...with a class hierarchy..."
- p. 20 "... are better able at recovering" -> "...are better capable of recovering"
- p. 21 "... the performance of KGEMs in terms of rank-based metrics are not..." -> "... is not"
- p. 21 " for a future work..." -> " for future work..."