Review Comment:
The paper proposes an approach, called Vecsigrafo, for the generation of a single, unified, representation for NLP task based on joint word-concepts embeddings. Such unified vectorial representation, generated from large corpora, shares the same vocabulary of a given Knowledge Graph and the authors provide an evaluation running different learning algorithms over a selection of corpora (of different sizes) and evaluate their results over a variety of tasks, such as semantic similarity, relatedness and word-concept and hypernym prediction.
The provided results show an improved performance of their type of embeddings with respect to word-only and knowledge graphs embeddings with medium size training corpora. Such effect is not reported for larger corpora.
In general, the overall approach proposed in the paper is methodologically sound and the provide results (based on standard dataset for each of the mentioned tasks) are quite interesting.
There are, however, some problematic parts that, in my opinion, should be properly addressed/clarified by the authors since they are, in my opinion, a source of confusion.
First: the authors make some strong claims when they say, for example (page 1) “powerful rule-based systems [1] failed because their reasoning formalism was focused on the operational aspects of inference”. What does it mean that they “failed”? I do not think this terminology is fair since rule-based systems are widely used in symbolic AI and, of course, in hybrid AI approaches.
In addition: the authors say that their proposal aims to work “at the knowledge level”. Unfortunately, they show a deep misunderstanding of what Allen Newell (that they quote) intends for “knowledge level”. This level, in the Newell sense, is intended as a level of analysis to understand and predict the rational behavior of a cognitive artificial agent simply on the basis of the analysis of the content of its available representations, its limited knowledge of its goals etc.).
On the other hand, what the authors actually propose in this paper is a possible solution that focuses more on the “representational” level. In the Newell hierarchy, this means that they focus at the “Symbol Level”. However, also this kind of account should be really specified and detailed since the type of joint-representation that they propose is not symbolic at all (while in Newell theoretical framework, i.e. the Physical Symbol System Hypothesis, the only type of representation assumed to “exist” is symbolic).
Another element that is quite confusing is the description provided by distributional semantics resources. They say that such resources are usually of “low dimensional spaces”. However, the exact contrary is true: distributional resources have hundreds of dimensions. There are, in fact, some methods that have been proposed to reduce the dimensionality of such resources by integrating resources such as Babelnet, NASARI and ConceptNet via Wordnet (please see Lieto, Mensa Radicioni, 2016 https://link.springer.com/chapter/10.1007/978-3-319-49130-1_32) on this point.
The above-mentioned paper can be also interesting, in my view, since it provides an alternative solution to the proposed joint-word-concept vectorial representations.
In particulate, it proposes an algorithmic procedure to generate a low dimension concept-level semantic resource (based on the ConceptNet properties) built on the top of linguistic resources (equipped with Wordnet and BabeNet Synset Ids). The obtained resource (based on concept-embeddings) obtained in this way is called COVER and has been evaluated is some of the SEMEVAL17 tasks mentioned in the paper: https://ls.di.unito.it/resources/cover/. A brief comparison of such approaches (from both a methodological and an applicative point of view, i.e. in the latter case by reporting the scores of such resources in the SEMEVAL17 tasks ) could be helpful.
Other concerns are for the section 3. In my opinion, it should be more self-contained. For example: the authors very briefly refer to the “Swivel algorithm” (also mentioned in previous parts of the paper) and to a modification of this algorithm that they adopted to learn the embeddings from a vocabulary. I would suggest the authors to extend this part by providing a clear example showing in what consists the actual modification of the Swivel algorithm that they provided.
Finally, in the discussion part, I would expect that the authors can come with a stronger explanation about why the proposed joint word-concept embeddings do not have effect in large corpora since the proposed conclusion is not very convincing: it the effect holds for small-medium size corpora I do not see why it should not hold for larger corpora. I feel that there is something missing here (or something that deserves a more detailed explanation).
Minor notes:
there is a missing reference on page 12 —> positive pairs[? ].
The results that they provided for NASARI is very unexpected. The authors report different results with respect to those documented for this resource (also for the nouns). It could be a good idea to contact to contact the NASARI developers in order to ask the reasons for this non-reproducibility results.
|