RDF2Vec: RDF Graph Embeddings and Their Applications

Tracking #: 1495-2707

Petar Ristoski
Jessica Rosati
Tommaso Di Noia
Renato De Leone
Heiko Paulheim

Responsible editor: 
Freddy Lecue

Submission type: 
Full Paper
Linked Open Data has been recognized as a valuable source for background information in many data mining and information retrieval tasks. However, most of the existing tools require features in propositional form, i.e., a vector of nominal or numerical features associated with an instance, while Linked Open Data sources are graphs by nature. In this paper, we present RDF2Vec, an approach that uses language modeling approaches for unsupervised feature extraction from sequences of words, and adapts them to RDF graphs. We generate sequences by leveraging local information from graph sub-structures, harvested by Weisfeiler-Lehman Subtree RDF Graph Kernels and graph walks, and learn latent numerical representations of entities in RDF graphs. We evaluate our approach on three different tasks: (i) standard machine-learning tasks (ii) entity and document modeling (iii) content-based recommender systems. The evaluation shows that the proposed entity embeddings outperform existing techniques, and that feature vector representations of general knowledge graphs such as DBpedia and Wikidata can be easily reused for different tasks.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 19/Dec/2016
Minor Revision
Review Comment:

In this manuscript, the authors propose a technique for learning latent numerical vector representations of entities in RDF graphs which is inspired by, and exploits the same ideas of, word embedding techniques like Google's Word2vec model.
This technique, called RDF2Vec, works in two phases. First of all, sequences of tokens are generated from an RDF graph; these sequences may thus be regarded as sentences constructed on the alphabet of IRIs, the identifiers of RDF entities (resources) and relations (properties). Once the RDF graph has thus been linearized, the same technique used by Word2vec may be applied; namely, a shallow two-layer neural network is trained to learn the context in which each "word" (i.e., RDF resource) appears.
For the first phase (RDF graph linearization), two "strategies" are proposed:
(i) random walks of a given length starting from a vertex, and
(ii) Weisfeiler-Lehman subtree graph kernels, adapted to suit the specificities of RDF graphs.
For the second phase (learning of the latent vector representation), two models are considered:
(i) the continuous bag-of-words (CBOW) model, which trains the neural network to predict a word given its context, and
(ii) the skip-gram (SG) model, which trains the neural network to predict the most likely context given a word.

The rest of the manuscript goes on to show, through an impressive set of experiments, that the embeddings produced by the RDF2Vec technique mey be used as representations of RDF resources (entities) in three data mining tasks, namely
(a) classification/predictive modeling (using the embeddings to enrich the features used by the classification method),
(b) clustering (computing the similarity between documents using information from the LOD as background knowledge), and
(c) information retrieval (recommender systems)
The empirical evidence compellingly suggests that using the RDF2Vec embeddings in these three data mining tasks improves the performance of state-of-the art methods.

The manuscript is an extension of two conference papers: "RDF2Vec", by two of the authors (Ristoski and Paulheim), presented at ISWC 2016, and "RDF Graph Embeddings for Content-Based Recommender Systems", by the same authors of this manuscript, presented at the CBRecSys 2016 workshop.

Overall, the manuscript is technically sound and well-organized. However, the presentation in Sections 3 to 5 should be improved and more details provided, for the sake of both clarity and repeatability.

Section 3

Definition 1 is not a specific definition of an RDF graph, but of a graph in general. No specific characteristic of an RDF graph is pointed out in the definition; however, in Sections 3.1.1 and 3.1.2, two specificities of RDF graphs are exploited, namely that arcs are directed (e.g., "outgoing edges") and that both vertices and arcs are labeled, even though the specific nature of labels is not really touched upon.
Therefore, I think the authors ought to define an RDF graph as a triple (V, E, \ell), where the arcs in E are directed and \ell is a labeling function from V union E to the IRIs (see also Shevarshidze et al., JMLR 12 (2011), [70]).

The last paragraph of Section 3.1.2 ("The procedure of converting the RDF graph...") is rather obscure. I think it would be a good idea to expand it and - why not? - to write out the algorithm's pseudo-code in a figure. Furthermore, it is not clear at all what the pattern of the sequences after the first iteration should be: what are T_1, T_d supposed to be? Numbers (subtree count), strings (but then, how constructed), or what?

Section 4

The last paragraph of this section gives the settings used for the empirical evaluation:
- DBpedia: all the walks with d = 2 and 500 walks with d = 4, 8;
- Wikidata: all the walks with d = 2 and 200 walks with d = 4;
Then: window size = 5, # iterations = 5, negative sampling, # samples = 25, etc.
How and why were these parameter settings determined?
How sensitive is the proposed approach to these settings?
I think these are rather crucial questions, which should be addressed by the authors.

Section 5

It is not completely clear how the items of the list under the paragraph starting with "We compare our approach to several baselines" relate with the row names in Tables 2ff.
Some of the abbreviations used there appear here between parentheses, but some do not and, as a consequence, the reader is (or, at least, I was) confused...
In particular, the first item, "Features derived from specific relations" (types, categories), are used in Tables 3 and 4, but they are not in Table 2. Why?
For WL, the author declare to use two pairs (?) of settings, d = 1, h = 2 and d = 2, h = 3, but then in Table 2 one finds WL_1_2 and WL_2_2 (instead of WL_2_3 which would be expected). Here, apart from the fact that two pairs of setting, for me, are 4 settings in total, stating explicitly to which label each setting corresponds would be more than helpful.

Then, in Section 5.2, the authors observe that the Wikidata SG 500 vectors outperform the other approaches. Do they have any explanation to offer as to why this happens?

Typos and Other Minor Points

On page 1, "Most algorithms work with a propositional feature vector...": I would rather say "Most *data mining* algorithms", which is what really the article is about.

Section 2.2.1, "word-distributed-based" -> "word-distribution-based".

Section 3.1.1, "in the vertex v" -> "in vertex v".

Section 3.2, "recent advancements in the field" -> "recent advances in the field".
"several efficient approaches has been proposed" -> ... have been proposed
"One of the most popular and widely used is..." -> "One of the most popular and widely used * approaches* is..."

Section 3.2.1, "comprised from all" -> "comprised of all"

Section 5.1, "The dataset contains of three..." -> "... contains three..."

Just before Section 6.1, "Google"and "Mark Zuckenberg are not similar at all, and have low relatedness value: I would rather say that they have a *somehow lower* relatedness value. After all, Google is a very successful Internet company and Mr Zuckerberg is the founder of a very successful Internet company (albeit, admittedly, not the same one)...

Section 7.1, "the strength of ith variable" -> "the strength of the ith variable".

Section 7.3.1, "For the sake of true," -> "For the sake of truth";
"the highest value of precision is gained" -> "... is achieved/obtained";
"films" -> "movies".

Section 7.3.2, "We remember that..." -> "We remind that..."

Section 8, "can be costly" -> "can be (computationally) expensive".

Review #2
By Jiewen Wu submitted on 29/Dec/2016
Minor Revision
Review Comment:

This paper presents an approach to encode RDF graphs into sequences, which are consequently used to train neural language models for different machine learning tasks. Overall, the paper is well written and presented, and it has provided a number of empirical evaluations on various tasks to show that the RDF2Vec approach can produce multi-purpose vector embeddings for these tasks.

Originality: the proposed approach uses unsupervised approach to generate sequences from RDF graphs, and seem to be able to capture some of the latent semantics, e.g., word order, word closeness, some semantic relationships, etc. It seems to be original from an application point of view.

Significance: the experimental results suggest that the feature vectors generated by the proposed approach outperform the baseline approaches and some state-of-the-art techniques in certain tasks.

Presentation: the paper has a good coverage of related work and is clearly presented. There are a few minor typos/errors, yet it is well written in general.

The approach itself is interesting and seems to be useful for various scenarios and applications. A variety of experiments were carried out and the experimental results seem to be convincing. It would be more useful to provide more insights on the results, e.g., why skip-gram models tent to outperform CBOW?

My main question is on the vectorization (and the use) of semantic entities. The intention of this paper is to use LOD as an information source where ML tasks can extract features. In other words, RDF graphs are treated more or less as (possibly cleaner) syntactic artefacts, given that the embeddings produced via the proposed approach eliminates most of the semantics that underly the surface textual sequences. As stated in the conclusions, the authors are aware of the loss of some semantics, e.g., no distinction between schema and data was made. The latent semantics captured by the vectors, as shown by PCA results (e.g., Figure 2, Figure 4), can be well presented using the original semantic relationships. As is known to us, PCA cannot derive the semantic relationship, rather it is up to humans to interpret what the reduced dimensions capture from the raw data. Does a vector capture all such semantic relationships? My point is: is the proposed vectorization of RDF graphs the best use for different ML tasks? This is in response to the claims made by the authors on the "universality" of the representation. Indeed, the generation of sequences can be task independent, but it is likely that different ML tasks will need to use RDF graphs differently, e.g., some may want to take advantage of the semantics (which is the core concept of semantic representations like RDF graphs).

Further comments:
1. Section 3, when converting triples into (graph walks) sequences, how is the order of different triples determined? e.g., for the same subject s, , , ..., it seems to me that the order makes a difference, depending on the tasks as the property p_1 may need to be closer to s compared to p_2 in one task, but the reverse is true in a different task.

2. Section 3.1.1, in the middle of the paragraph: what is the definition of e_{1i}? what does the digit 1 stand for? If E is a set, what is E(v_r)? Following this sentence, there is also a notation of v_{1i}, please clarify.

3. Section 3.2.1, ".. and a context window c", I believe you meant to say the window is of size c.

4. Section 5. When performing classification/regression, did the authors perform any feature selection on the given features? Were all feature used? What about multicollinearity? Were there any features that has a lot of missing values? How were missing values handled? It will be more helpful and convincing to mention what actions were taken to arrive at the presented accuracy or RMSE.

5. Section 5.3., line 6, it should read Figure 2 (not 4).

6. Section 5.4., (Figure 3 as well) There is no clarification on the term/strategy RDF2vec. What does it refer to?

Review #3
By Achim Rettinger submitted on 24/Feb/2017
Major Revision
Review Comment:

The paper is about sequentialization of RDF data to get high quality features for three different tasks: (i) clustering & regression (ii) document modeling and (iii) content-based recommender systems. The sequentialization is achieved by performing random walks or using a Weisfeiler-Lehmann graph kernel on a RDF graph to create a text-like representation. Afterwards word2vec is used on the created text to get vectors for the entities in the RDF graph. This approach is evaluated on many different datasets within the three different tasks.

We think the integration of knowledge from RDF data is a very important topic and the results are convincing but the description of the approach is lacking important details:

* There are different versions of the Weisfeiler-Lehmann graph kernel and therefore it is not clear how the text is created and how it looks like e.g. it is not clear whether the edges are used. Correspondingly the information why the edges are included or excluded would be needed. An illustration of how this kernel is applied on a very small example would help.

* For applying word2vec, it does not become clear whether the authors create one big text for the whole graph or multiple text files are generated. Is the sliding window going over the whole text or does it start and end with each walk sequence of a root node? To prevent the proximity of entities which do not share common characteristics, would speak for the latter but we couldn't find this information in the text.

* The parameter settings are not explained, e.g. why do the authors use 500 walks on DBpedia with depth 4 and 8 but only 200 on Wikidata with depth 4? Parameters like window size, number of iterations, negative samples, ... are mentioned but not evaluated nor discussed. Also, the rather important decision of the dimensionality is reduced to only 200 vs 500. Since the parameter settings for DBpedia and Wikidata are different, the significance of statements like "it clearly emerges that DBpedia lets to gain higher values than Wikidata for each metric involved"(p.20) are questionable.“

* Furthermore, the influence of the traversal strategy for Graph Walks is not investigated (see [1]).

Overall we would recommend a major revision for three reasons:

a) No other embedding approaches (TransE, Rescal, HolIE,..) are compared to the proposed approach.

b) There is no novelty in regard to the embedding approach compared to the authors’ ISWC publication.

c) There is limited additional insights from the new experiments, since an in depth discussion is missing and the experimental setup is sometimes not fully transparent (see above).

Below are further open points:

-- Missing explanation / reference --

p.6: "While some of the initially proposed approaches suffered from inefficient training of the neural network models, with the recent advancements in the field several efficient approaches has been proposed." Which initially proposed approaches? And which advancements? Negative sampling? word2vec only has one projection layer which shouldn't be affected by novel deep learning techniques.
p.7: "Empirical studies haven shown that in most cases negative sampling leads to a better performance than hierarchical softmax, which depends on the selected negative samples, but it has higher runtime." Reference?
p.7: "1,368 mapping-based properties" Mapping-based properties not explained.
p.8: "The dataset contains of three smaller-scale RDF datasets (i.e., AIFB, MUTAG, and BGS) [...] Details on the datasets can be found in [63]" Why did you leave out the AM dataset which was the fourth dataset described in [63]?
p.8: Illustrational example for the different features (rel in, rel-vals in, ...) would help to understand the baseline.
p.11: “Experiments that did not finish within ten days, or that have run out of memory" What was the configuration? How much memory?
p.14: The baseline approach is not clear and cannot reproduced without further details.
p.14: Is AnnOv an own approach? Reference? Difference to KORE and GBSS approach?
p.23: "they may also become an interesting technique for improving those sources (such as public knowledge graphs) as such" -> How?

-- Parameter Settings & Results --

p.7-8: number of iterations is different without explanation
p.8: "We use two pairs of settings, d = 1, h = 2 and d = 2, h = 3" and the same for "the length of the paths l" Missing explanation for choosing these (low) values?
p.11: Any explanation for the surprisingly low accuracy of SVM combined with DB2vec, especially in respect to the WD2vec combination?
p.15: Is the KORE - Video Games outlier the only outlier? In how many of the 21 entities do you outperform all other approaches?
p.19-20: F1-measure missing in the tables.
p.19: Are there any insights why e.g. DB2vec SG 500 4 outperforms WD2vec SG 500 4? Are there any structural differences or is it because of the coverage of Wikidata and DBpedia? Additional experiments with fixing the covered facts would be interesting.
p. 21: "We repeated the experiments three times for each datasets and then averaged the results across the three runs." Please add the standard deviation to the tables.

-- Notation & Style --

p.5: E(v_r) not introduced
p.5: v_r -> e_1i -> v_1i Shouldn't it be v_i instead of v_1i?
p.8: "c=" probably "classes" but not defined
p.9: "parameter C" not defined
p.14: equation between Eq. 7 and Eq. 8 is not labeled
p.17: Other URI references were in the footnotes.
p.19: equation is not labeled
p.19: Missing normalization term in the equation like "1/5" or "1/(|ratedItems(u)|)"?
p.20: Other URI references were in the footnotes.
The formatting style for tools, products, etc. varies, e.g. DBpedia is sometimes printed in verbatim.

-- Errata --

p.3: "[13,20,1]" -> "[1,13,20]"
p.3: "SalientSemantic Analysis (SSA) [18] use hyperlinks" -> "SalientSemantic Analysis (SSA) [18] uses hyperlinks"
p.3: "One approach that does not belong to these two main direction of research" -> "One approach that does not belong to these two main directions of research"
p.5: "after the second iteration will follow the following patter" -> "after the second iteration will follow the following pattern"
p.8: "proprty" -> "property"
p.13: "We follow similar approach" -> "We follow a similar approach"
p.14: " To obtain the final similarity judgments, the scores are averaged for each pair the scores of all annotators."
p.16: "So far, most of the proposed approaches in the literature are supervised or semi-supervised, which means [they] cannot work without human interaction. "
p.17: text in the margin

[1] Grover, Aditya, and Jure Leskovec. "node2vec: Scalable feature learning for networks." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.