|Review Comment: |
In this manuscript, the authors propose a technique for learning latent numerical vector representations of entities in RDF graphs which is inspired by, and exploits the same ideas of, word embedding techniques like Google's Word2vec model.
This technique, called RDF2Vec, works in two phases. First of all, sequences of tokens are generated from an RDF graph; these sequences may thus be regarded as sentences constructed on the alphabet of IRIs, the identifiers of RDF entities (resources) and relations (properties). Once the RDF graph has thus been linearized, the same technique used by Word2vec may be applied; namely, a shallow two-layer neural network is trained to learn the context in which each "word" (i.e., RDF resource) appears.
For the first phase (RDF graph linearization), two "strategies" are proposed:
(i) random walks of a given length starting from a vertex, and
(ii) Weisfeiler-Lehman subtree graph kernels, adapted to suit the specificities of RDF graphs.
For the second phase (learning of the latent vector representation), two models are considered:
(i) the continuous bag-of-words (CBOW) model, which trains the neural network to predict a word given its context, and
(ii) the skip-gram (SG) model, which trains the neural network to predict the most likely context given a word.
The rest of the manuscript goes on to show, through an impressive set of experiments, that the embeddings produced by the RDF2Vec technique mey be used as representations of RDF resources (entities) in three data mining tasks, namely
(a) classification/predictive modeling (using the embeddings to enrich the features used by the classification method),
(b) clustering (computing the similarity between documents using information from the LOD as background knowledge), and
(c) information retrieval (recommender systems)
The empirical evidence compellingly suggests that using the RDF2Vec embeddings in these three data mining tasks improves the performance of state-of-the art methods.
The manuscript is an extension of two conference papers: "RDF2Vec", by two of the authors (Ristoski and Paulheim), presented at ISWC 2016, and "RDF Graph Embeddings for Content-Based Recommender Systems", by the same authors of this manuscript, presented at the CBRecSys 2016 workshop.
Overall, the manuscript is technically sound and well-organized. However, the presentation in Sections 3 to 5 should be improved and more details provided, for the sake of both clarity and repeatability.
Definition 1 is not a specific definition of an RDF graph, but of a graph in general. No specific characteristic of an RDF graph is pointed out in the definition; however, in Sections 3.1.1 and 3.1.2, two specificities of RDF graphs are exploited, namely that arcs are directed (e.g., "outgoing edges") and that both vertices and arcs are labeled, even though the specific nature of labels is not really touched upon.
Therefore, I think the authors ought to define an RDF graph as a triple (V, E, \ell), where the arcs in E are directed and \ell is a labeling function from V union E to the IRIs (see also Shevarshidze et al., JMLR 12 (2011), ).
The last paragraph of Section 3.1.2 ("The procedure of converting the RDF graph...") is rather obscure. I think it would be a good idea to expand it and - why not? - to write out the algorithm's pseudo-code in a figure. Furthermore, it is not clear at all what the pattern of the sequences after the first iteration should be: what are T_1, T_d supposed to be? Numbers (subtree count), strings (but then, how constructed), or what?
The last paragraph of this section gives the settings used for the empirical evaluation:
- DBpedia: all the walks with d = 2 and 500 walks with d = 4, 8;
- Wikidata: all the walks with d = 2 and 200 walks with d = 4;
Then: window size = 5, # iterations = 5, negative sampling, # samples = 25, etc.
How and why were these parameter settings determined?
How sensitive is the proposed approach to these settings?
I think these are rather crucial questions, which should be addressed by the authors.
It is not completely clear how the items of the list under the paragraph starting with "We compare our approach to several baselines" relate with the row names in Tables 2ff.
Some of the abbreviations used there appear here between parentheses, but some do not and, as a consequence, the reader is (or, at least, I was) confused...
In particular, the first item, "Features derived from specific relations" (types, categories), are used in Tables 3 and 4, but they are not in Table 2. Why?
For WL, the author declare to use two pairs (?) of settings, d = 1, h = 2 and d = 2, h = 3, but then in Table 2 one finds WL_1_2 and WL_2_2 (instead of WL_2_3 which would be expected). Here, apart from the fact that two pairs of setting, for me, are 4 settings in total, stating explicitly to which label each setting corresponds would be more than helpful.
Then, in Section 5.2, the authors observe that the Wikidata SG 500 vectors outperform the other approaches. Do they have any explanation to offer as to why this happens?
Typos and Other Minor Points
On page 1, "Most algorithms work with a propositional feature vector...": I would rather say "Most *data mining* algorithms", which is what really the article is about.
Section 2.2.1, "word-distributed-based" -> "word-distribution-based".
Section 3.1.1, "in the vertex v" -> "in vertex v".
Section 3.2, "recent advancements in the field" -> "recent advances in the field".
"several efficient approaches has been proposed" -> ... have been proposed
"One of the most popular and widely used is..." -> "One of the most popular and widely used * approaches* is..."
Section 3.2.1, "comprised from all" -> "comprised of all"
Section 5.1, "The dataset contains of three..." -> "... contains three..."
Just before Section 6.1, "Google"and "Mark Zuckenberg are not similar at all, and have low relatedness value: I would rather say that they have a *somehow lower* relatedness value. After all, Google is a very successful Internet company and Mr Zuckerberg is the founder of a very successful Internet company (albeit, admittedly, not the same one)...
Section 7.1, "the strength of ith variable" -> "the strength of the ith variable".
Section 7.3.1, "For the sake of true," -> "For the sake of truth";
"the highest value of precision is gained" -> "... is achieved/obtained";
"films" -> "movies".
Section 7.3.2, "We remember that..." -> "We remind that..."
Section 8, "can be costly" -> "can be (computationally) expensive".