The RDF2vec Family of Knowledge Graph Embedding Methods

Tracking #: 3319-4533

Jan Portisch
Heiko Paulheim

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
Knowledge graph embeddings represent a group of machine learning techniques which project entities and relations of a knowledge graph to continuous vector spaces. RDF2vec is a scalable embedding approach rooted in the combination of random walks with a language model. It has been successfully used in various applications. Recently, multiple variants to the RDF2vec approach have been proposed, introducing variations both on the walk generation and on the language modeling side. The combination of those different approaches has lead to an increasing family of RDF2vec variants. In this paper, we evaluate a total of twelve RDF2vec variants on a comprehensive set of benchmark models, and compare them to seven existing knowledge graph embedding methods from the family of link prediction approaches. Besides the established GEval benchmark introducing various downstream machine learning tasks on the DBpedia knowledge graph, we also use the new DLCC (Description Logic Class Constructors) benchmark consisting of two gold standards, one based on DBpedia, and one based on synthetically generated graphs. The latter allows for analyzing which ontological patterns in a knowledge graph can actually be learned by different embedding. With this evaluation, we observe that certain tailored RDF2vec variants can lead to improved performance on different downstream tasks, given the nature of the underlying problem, and that they, in particular, have a different behavior in modeling similarity and relatedness. Our experiments on a gold standard created from the real-world knowledge graph DBpedia reveal that all approaches perform surprisingly well due to correlating signals. On a synthetic dataset without such correlating signals, in contrast, we can observe that there are quite a few classes which are hard to learn for all inspected knowledge graph embedding methods. For RDF2vec, we can observe that walk strategies influence the balance of similarity and relatedness.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 30/Jan/2023
Minor Revision
Review Comment:

The paper presents a detailed analysis of the learning capabilities of the family of RDF2vec embeddings.

The paper is an extension of three formerly published workshop and conference papers by the same authors. In particular, the new parts are the hypotheses in section 5 and their discussion in section 7.4 along with a comprehensive suite of experiments. While parts were published before, the paper as a whole is an interesting summary of a very long and prominent research endeavor, and thus I think it is suitable for publication in an archival journal.

The results are interesting both from an applicational and (semi-)theoretical perspective. The paper can be used by a practitioner to make an informed choice on which RDF2vec variant to choose. The DLCC and the benchmarks arising from it are a very valuable contribution enabling the comparison of different embeddings in a systematic manner on multiple semantic dimensions instead of on entangled, downstream datasets. Especially due to this contribution, I expect the paper to be highly cited.

I have no major complaints regarding the correctness, however, there are some aspects that require clarification or extending the proposed framework:
* It is unclear to me why bring the DLs into the picture at all. Are the proposed constructors exhaustive in any sense relevant to DLs, e.g., are they all possible expressions of depth 1 that could be constructed using ALC?
* This is very subjective, but I also think DLs are not a suitable formalism since you are dealing with graphs, not ontologies. There is no guarantee that the graph even has the notion of the top concept. It seems to me that SPARQL BGPs would be a much better formalism: you would not go out of the graph world, you would not introduce possibly foreign concepts, and you would not need to make a kind-of weird-looking mix of DLs and low-level RDF in Eq. (13), and you would not need to explain to me how do you use DL expressions to query a SPARQL endpoint in Section 6.1.
* I am not convinced "hypothesis" is a good term for what you have proposed in the paper. For example, compare Hypothesis 6a with the hypotheses posed in Section 3.1 of [1]. To me, it sounds quite similar to some of them. I would suggest either making them more formal or using a different, less loaded word.
* Section 6.2: Why those particular 6 classifiers? I suspect each is from a different family, but there are decision trees and random forests, and nevertheless, the argument should be given explicitly instead of being left for the reader to guess.
* Section 6.4: Algorithm 1 reads like LUBM [2]. Some argument should be given why we need yet another generator, especially since it seems to be very rigid, e.g., always generating a class hierarchy in the form of a balanced tree.
* Section 6.4: Why resembling DBpedia is important? Are statistical properties of DBpedia representative of many KGs or what?
* Section 7.2: You make conclusions about one variant being better than another on a suite of datasets. This seems to call for a statistical test, e.g., the Friedmann test, and then paired t-tests with necessary corrections.
* Section 7.3: It is not clear to me why a one-sided binomial significance test is a good choice. At the very least I would expect you to report the null hypothesis and the alternative hypothesis.

[1] Abraham Bernstein Natasha Noy "Is This Really Science? The Semantic Webber’s Guide to Evaluating Research Contributions"

Overall, the paper reads well. However, there are some aspects that seem vague and/or inconsistent.
* Section 3.1: It is not clear if w0 is a distinguished node and a random walk spans in both directions from it, or if the elements of the vectors are indexed from -n/2 to n/2 just to confuse the reader. If the former, then it is inconsistent with Figure 1. If the latter, please don't.
* Eq. (3) represents a vector of the length n/2 (since the indices are incremented by 2). Is it still a random walk of the length n?
* Section 3.2: The names CBOW and SG stand for should be given, since for a reader not intimately familiar with RDF2vec it is unclear what the original configurations were.
* You usually use the abbreviation DLCC which stands for Description Logic Class Constructors, but Section 5's title is DL Constructors. I think this is inconsistent, or you need to explain the difference in naming properly.
* Section 5/Eq. (9): There's a dot missing after \exists R_2^{-1}
* Section 5/Cardinality restrictions: "the corresponding decision problem is between the two variants" What do you mean?
* Section 5.1: This section is weirdly short and since it is the only subsection of Section 5, I would remove it. In my opinion Table 2 is very hard to read and rather pointless since the hypotheses without the experimental results are not that interesting.
* Section 6, introduction: What does it mean that a gold standard is "officially published"?
* Section 6.3: The notion of hard negatives requires an explicit definition, giving an example is not sufficient.
* I would make Tables 3-5 an appendix since they are quite large and not that important, but that's only a suggestion.
* Section 7.3/Figure 4: It is not clear to me what exactly is depicted in Figure 4. How do you compute complexity? Offer a formula or an algorithm.
* Table 6 (and later): What exactly is ACC? Accuracy?
* The penultimate paragraph of Section 7.3: "(...) most models are not actually learning the description logic constructor but instead are picking up cross-correlations very well." This is unclear to me. I understand the example, but I don't understand the generalization.

There seems to be no "Long-term stable URL for resources". The paper itself references multiple resources, both on GitHub and Zenodo. Based on the claims in the paper, it seems to me they should be sufficient to reproduce the results. However, for the final version of the paper, I would recommend creating a single resource containing all the necessary code and datasets, along with a README on how to reproduce the results presented in the paper.

Review #2
Anonymous submitted on 03/Mar/2023
Major Revision
Review Comment:

The authors reported an extensive evaluation of 12 RDF2vec embeddings by utilizing standard benchmarks and a newly introduced benchmark for description logic constructors.

Using symbolic logic to evaluate the quality of knowledge graph embedding is a very nice idea. This shall be motivated and highlighted in the paper. The work extends three already published at ISWC in two aspects: one is the theoretical hypotheses about the representational power of different RDF2vec based variants and test them with system benchmarks; the other is a full comparison of twelve RDF2vec variants and seven additional baseline models. With this evaluation, the authors conclude that “certain tailored RDF2vec variants can lead to improved performance on different downstream tasks, given the nature of the underlying problem, and that they, in particular, have a different behavior in modeling similarity and relatedness.” This conclusion is not strong at all, does not increase much of our understanding of RDF2vec embeddings.

With regard to the writing, the language shall be improved. There are a number of long sentences shall be made shorter and clearer. Terms, e.g., p-RDF2vec, are sometimes in the mathematical format, sometimes not.

In the revised version, please make the motivation clear and natural, the conclusion strong, and the writing clean and simple.

Review #3
Anonymous submitted on 07/Mar/2023
Review Comment:

Summary of the paper:
In this paper, authors evaluate the quality of twelve RDF2vec variants on GEval benchmark and DLCC benchmark, and compare them to seven knowledge graph embedding methods.

Detailed comments:
1) Originality: This is an extended work based on previously published works [15,16,17]. Different from previous works, this work focused on the evaluation of RDF2vec variants, especially whether they could learn different kinds of DL constructors. The main contributions stated by the author are (1) an in-depth evaluation of RDF2vec family (2) an evaluation on novel tasks which refers to DLCC benchmark. Considering that some RDF2vec variants mentioned in the related work are not included in the evaluation, and DLCC is already proposed in previous work, the originality and the novelty of this paper are limited.

2) Quality of writing: The paper is easy to understand. Some paragraphs are exactly the same as paragraphs in [17], which is not expected to happen. Some of the statement is improper and should be rewritten. Two examples are:
- "We define multiple description logic (DL) constructors together with hypotheses and create two benchmarks.." The DL constructors are the same as in [17]
- "This is the first attempt to understand what knowledge graph embedding methods can actually represent," This is also the contribution of [17]

3) Other comments:
- In the related work section, it is said [18] distinguish five different techniques for KGE, while such category is proposed for graph embedding which is a broader topic. Is it proper to use this category for KGE?
- For evaluation of KGE, MRR is also a common metric which is missing in related work of knowledge graph embedding evaluation.
- SG and CBOW are first mentioned in the first line of section 3.2 without explanations.
- It is not explained what is the difference between tick and (tick) in table 2.
- Figures 2 and 3 are the same as in [17]
- As a work target at evaluating RDF2vec variants, it needs more in deep analysis of RDF2vec variants based on experiment results.

In summary, my main concern about this paper is the limited contribution and novelty.