Vecsigrafo: Corpus-based Word-Concept Embeddings - Bridging the Statistic/Symbolic Representational Gap

Tracking #: 1864-3077

José Manuel Gómez-Pérez
Ronald Denaux1

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
The proliferation of knowledge graphs and recent advances in Artificial Intelligence have raised great expectations related to the combination of symbolic and distributional semantics in cognitive tasks. This is particularly the case of knowledge-based approaches to natural language processing as near-human symbolic understanding and explanation rely on expressive structured knowledge representations that tend to be labor-intensive, brittle and biased. This paper reports research addressing such limitations by capturing as embeddings in a joint space both words and concepts from large document corpora. We compare the quality of the resulting embeddings and show that they outperform word-only embeddings for a given corpus.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 28/Apr/2018
Major Revision
Review Comment:

This paper proposes a method to learn both knowledge-based concepts and words in the same vector space. The paper provides an extensive evaluation on different benchmarks comparing with other knowledge-based and word embeddings approaches. The results show the potential of the proposed approach. In general the paper is quite interesting and thorough: it deals with an important problem, propose a quite flexible solution (word and concept embeddings in the same space) and shows the advantages of the proposed model over existing solutions. However, there are some presentation, methodological and evaluation issues that prevents me from accepting it as it is. My main comments/concerns are listed below.

Section 3 explaining the methodology is extremely short and lacks many details. Since this represents the core, it should be definitely extended in order to be self-contained. In particular, since it seems heavily based on the Swivel algorithm, this should be further explained. Also some equation is mentioned but not written, which makes it very difficult to follow if one is not familiarized with the previous works.

The method is not completely novel (the authors mention SW2V as a very similar method consisting of similar steps). However, the evaluation shows that his method may indeed have some advantages with respect to this similar method (but please see my below concerns about the evaluation, which are related to this).

The evaluation provides and describes many benchmarks and results (which I appreciate and it is always welcome) but it is in general a bit messy. It does not go to the point and it is difficult to interpret the results. One clear improvement would be to compare the methods using the same corpus and lexical resource (at least the similar ones) in order to appreciate better the advantages/disadvantages of the proposed method. This is not always the case in the evaluation, and even though many benchmarks are proposed, it is not clear what is the cause of the improvements of the proposed method. Is it thanks to disambiguation algorithm? (by the way, this disambiguation algorithm should also be properly described. Why not using an state-of-the-art disambiguation system instead?). Is it thanks to the modified co-occurrence matrix construction? Is it thanks to the use of Sensigrafo? Or is it thanks to the combination of all? In my opinion, these should be some of the main questions to be answered in the evaluation, but in the current form the answer is not clear (an ablation test also could help). Since, as I understood, the knowledge graph (Sensigrafo) and the disambiguation algorithm are proprietary, maybe some experiments should be carried out using other networks/disambiguation algorithms (or using other models on Sensigrafo), as in the experiment of Section 4.5.

Minor comments:
In page 7, Section 4.2, the method to compute similarity is described. This strategy is commonly used in the literature (see e.g. Resnik, 1995, which is the first article in which this strategy was introduced to the best of my knowledge).

In the evaluation, it is mentioned that NASARI embeddings seem to not perform well on these benchmarks but the authors do not have an answer. About this, it is important to mention what vocabulary is used for this model, as concepts need to be linked to words somehow (is the same method mentioned in my previous comment used?). Since it contains also Wikipedia, is it possible that the high coverage is damaging its performance? The whole Wikipedia/BabelNet vocabulary is taken into account?

I understood the prediction task to be the same as the hypernym task. However, there seems to be a confusion using these two terms interchangeably. I would suggest explaining clearly that they are the same or (better in my opinion) be consistent about the nomenclature. About this task, why not using a standard dataset instead of creating a new one? For example, from SemEval tasks or other hypernymy prediction datasets. Also, in this experiment it is mentioned that a validation set is used. What is used for? What parameters were tuned on this validation set?

It is mentioned in several parts of the evaluation that word similarity may not be ideal because it provides only one metric. In this paper only Spearman is reported, although Pearson has also been used for this task. Anyway, I agree with the general idea that even though similarity provides a good starting point for intrinsic evaluation of distributed representations, it may not be enough to assess the capability of these models in its own.

The paper is well-written in general, although it also contains a few template issues and with the presentation:

Page 1: “bottleneck[2]” -> “bottleneck [2]”
Page 5: It is clear that the format and the line spacing in particular are different for each column (this issue is also present in other pages).
Page 6: “and and” at the end of the first column
In the Evaluation section when presenting the tasks, it would be helpful for the reader to link them to the specific sections. Also in general as mentioned also earlier, the presentation in the evaluation should be changed so it is more clear what is being evaluated and why, in order to achieve better general conclusions.
There are some words that go beyond the first column (e.g. Page 10 “scraped”)
Page 12: reference missing after “simply memorising word associated to positive pairs”
Page 13, Figure 4: Why using a figure for displaying the results in this case? Better to also show them on a table as with the other evaluations.

Review #2
Anonymous submitted on 30/May/2018
Major Revision
Review Comment:

The paper presents an extensive work on training of different word embeddings over a corpus (corpora) annotated with senses. Thus the embeddings are related to words (in this case lemmas) and their senses. The resulting embeddings are called Vecsigrafo.

The processing is as follows:
- a knowledge graph is selected. Generally a knowledge graph is a set of concepts with a relational structure and a lexicon aligned to it. The lexicon also could contain multiword expressions.
- a corpus - set of texts.
- a processing pipeline for semantic annotation of the corpus with concepts from the knowledge graph
- the authors are using SWIVEL system for training Vecsigrafo. It is applied in two steps: SWIVEL PREP for producing Co-occurrence matrix and then SWIVEL for creations of the embeddings in Vecsigrafo.

Two knowledge graphs are used in the experiments:
- a proprietary semantic network Sensigrafo (similar to Wordnet) with about 400 K lemmas and 300 K concepts.
- BabelNet 3.0

Processing is done by Cogito Studio 14.2

After the annotation of the corpus (corpora) filtering step is applied in order to delete punctuation, functional words, etc.

Besides SWIVEL other systems for creation of embeddings as GloVe, FastText, HolE

The rest of the paper evaluates the results with different tasks and discussion of the results from the evaluation.

Obviously a lot of experiments are performed, but the main idea behind this work is poorly presented in the text.

The idea of using semantically annotated corpus is not new.

The new part is using of a different processing pipeline which is not described in the paper. For example:
- what is the performance of the processing pipeline over the different corpora used in the evaluation? 60 % or 90 % ?
- how is tokanization different from other approaches?
- what is the performance of the filtering step?
- how is the semantically annotated corpus represented to the word embedding systems?
lemma_1 sense_1 lemma_2 sense_2 ... or in a different way
- "For all the embeddings, we have used a minimum frequency of 5 and a window size of 5 words around the target word." This has to be true for lemmas and for concepts, but the frequency could be different for lemmas and for concepts. Then the window depends very much on the representation of the annotated corpora.

From all this it is not clear in the paper how the embeddings are created. The new contribution is not clearly described. Even such thing as using Sensigrafo and BabelNet as two knowledge graphs is not clear how impact the embeddings. Is the performance the same for both knowledge graphs?

On page 6, second column in the bullet for HolE Sensigrafo corpus is mentioned. It is not clear which corpus is this.

In the evaluation part of course mainly word embeddings are compared. Not always it is clear what is measured word form or lemma embeddings. This is important because in the different sets for similarity (for example) word forms are used. If the corresponding pairs are lemmatized the meaning could change and thus the human judgment could be wrong. For example, "media/radio" in Wordsim353 after lemmatization will be "medium/radio" which could be understand in different way than the original pair.

In my view in order the paper to be useful for the community it needs major rewriting. It has to describe in a better way the main contribution to the creation of embeddings. Then in the evaluation part it is necessary to separate evaluation of Vecsigrafo itself with respect to the different parameters. Then comparisons of the other systems for word embeddings over the same corpora.

Review #3
Anonymous submitted on 06/Jun/2018
Major Revision
Review Comment:

The paper proposes an approach, called Vecsigrafo, for the generation of a single, unified, representation for NLP task based on joint word-concepts embeddings. Such unified vectorial representation, generated from large corpora, shares the same vocabulary of a given Knowledge Graph and the authors provide an evaluation running different learning algorithms over a selection of corpora (of different sizes) and evaluate their results over a variety of tasks, such as semantic similarity, relatedness and word-concept and hypernym prediction.

The provided results show an improved performance of their type of embeddings with respect to word-only and knowledge graphs embeddings with medium size training corpora. Such effect is not reported for larger corpora.

In general, the overall approach proposed in the paper is methodologically sound and the provide results (based on standard dataset for each of the mentioned tasks) are quite interesting.

There are, however, some problematic parts that, in my opinion, should be properly addressed/clarified by the authors since they are, in my opinion, a source of confusion.

First: the authors make some strong claims when they say, for example (page 1) “powerful rule-based systems [1] failed because their reasoning formalism was focused on the operational aspects of inference”. What does it mean that they “failed”? I do not think this terminology is fair since rule-based systems are widely used in symbolic AI and, of course, in hybrid AI approaches.

In addition: the authors say that their proposal aims to work “at the knowledge level”. Unfortunately, they show a deep misunderstanding of what Allen Newell (that they quote) intends for “knowledge level”. This level, in the Newell sense, is intended as a level of analysis to understand and predict the rational behavior of a cognitive artificial agent simply on the basis of the analysis of the content of its available representations, its limited knowledge of its goals etc.).
On the other hand, what the authors actually propose in this paper is a possible solution that focuses more on the “representational” level. In the Newell hierarchy, this means that they focus at the “Symbol Level”. However, also this kind of account should be really specified and detailed since the type of joint-representation that they propose is not symbolic at all (while in Newell theoretical framework, i.e. the Physical Symbol System Hypothesis, the only type of representation assumed to “exist” is symbolic).

Another element that is quite confusing is the description provided by distributional semantics resources. They say that such resources are usually of “low dimensional spaces”. However, the exact contrary is true: distributional resources have hundreds of dimensions. There are, in fact, some methods that have been proposed to reduce the dimensionality of such resources by integrating resources such as Babelnet, NASARI and ConceptNet via Wordnet (please see Lieto, Mensa Radicioni, 2016 on this point.

The above-mentioned paper can be also interesting, in my view, since it provides an alternative solution to the proposed joint-word-concept vectorial representations.
In particulate, it proposes an algorithmic procedure to generate a low dimension concept-level semantic resource (based on the ConceptNet properties) built on the top of linguistic resources (equipped with Wordnet and BabeNet Synset Ids). The obtained resource (based on concept-embeddings) obtained in this way is called COVER and has been evaluated is some of the SEMEVAL17 tasks mentioned in the paper: A brief comparison of such approaches (from both a methodological and an applicative point of view, i.e. in the latter case by reporting the scores of such resources in the SEMEVAL17 tasks ) could be helpful.

Other concerns are for the section 3. In my opinion, it should be more self-contained. For example: the authors very briefly refer to the “Swivel algorithm” (also mentioned in previous parts of the paper) and to a modification of this algorithm that they adopted to learn the embeddings from a vocabulary. I would suggest the authors to extend this part by providing a clear example showing in what consists the actual modification of the Swivel algorithm that they provided.

Finally, in the discussion part, I would expect that the authors can come with a stronger explanation about why the proposed joint word-concept embeddings do not have effect in large corpora since the proposed conclusion is not very convincing: it the effect holds for small-medium size corpora I do not see why it should not hold for larger corpora. I feel that there is something missing here (or something that deserves a more detailed explanation).

Minor notes:

there is a missing reference on page 12 —> positive pairs[? ].

The results that they provided for NASARI is very unexpected. The authors report different results with respect to those documented for this resource (also for the nouns). It could be a good idea to contact to contact the NASARI developers in order to ask the reasons for this non-reproducibility results.