Distributional and Neural Models for Extracting Manipulation-Relevant Relations from Text Corpora

Tracking #: 1542-2754

Authors: 
Soufian Jebbara
Valerio Basile
Elena Cabrio
Philipp Cimiano

Responsible editor: 
Guest Editors ML4KBG 2016

Submission type: 
Full Paper
Abstract: 
In this paper we present a novel approach based on neural network techniques to extract common sense knowledge from text corpora. We apply this approach to extract of common sense knowledge about everyday objects that can be used by intelligent machines, e.g. robots, to support planning of tasks that involve object manipulation. The knowledge we extract is constituted by relations that relate some object (type) to some other entity such as its typical location or typical use. Our approach builds on the paradigm of distributional semantics and frames the task of extracting such relations as a ranking problem. Given a certain object, the goal is to extract a ranked list of locations or uses ranked by `prototypicality'. This ranking is computed via a similarity score in the distributional space. We compare different techniques for constructing this semantic distributional space. On the one hand, we use the well known SkipGram model to embed words into a low-dimensional distributional space, using cosine similarity to rank the various candidates. We also consider an approach in the spirit of latent semantic indexing that relies on the NASARI approach to compute low-dimensional representations that are also ranked by cosine similarity. While both methods were already proposed in earlier work, as main contribution in this paper we present a neural network approach in which the ranking or scoring function is directly learned using a supervised approach. We compare all these approaches showing superiority of the neural network approach for some evaluation measures compared to the other two approaches described above for the construction of the distributional space.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ziqi Zhang submitted on 04/Feb/2017
Suggestion:
Major Revision
Review Comment:

This paper introduces a method for learning common sense relations from text. Unlike traditional relation extraction/classification, I think the task described here has its ‘quirkiness’ and could be interesting in some practical situations. The writing is generally good, despite a couple of places where some clarification is needed. However, my main concerns are: 1) I have some reservations about the usefulness of the type of relations targeted in this work. Although I am leaning to be positive, I think the authors should better define, contrast, and motivate the task to make it more convincing for readers; 2) the method is only tested on two relations, which I find difficult to assess the genericity of the approach; 3) the novel part of the work does not show obvious strength over the previous work according to experiments.

In my opinion there are a lot of work to do but if it can be done and positive results are obtained, there is still good value to the relevant research. Therefore I decided to give the authors the opportunity to consider a revision. Details of my comments are below.

Task definition and motivation:
Since this work addresses ‘relation extraction’ of some sort it should better compare and contrast with literature in this area to show what is different and why. It was not clear to me until I see examples in the experiment to grasp how the extraction of ‘common sense knowledge’ really differ from classic IE tasks. I think authors should restructure their argument along these lines: 1) to enable robotic intelligence robots must be equipped with common sense knowledge; 2) traditional IE could be ineffective on common sense knowledge (this is the key), because such knowledge is often not explicitly encoded in text. Use examples: food is for eating, water is for drinking, cooking pan is located in kitchen etc. Using the last example, it will be more often to find sentences explicitly describing locations of a cooking pan at fine-grained levels (e.g., on the stove, on the table, inside the cupboard), but mentions of more general locations (common sense) may be rare. As a result, classic IE techniques can fail; 3) what kind tasks would enable robots to do with such knowledge (esp. the knowledge base you are aiming to construct) – this is unclear in the current paper. Although you mentioned about finding things, I think it is not convincing enough (e.g., a robot knows to look for a cooking pan in kitchen, but is that information good enough? If so why, if not why is this so important?). consider telling a convincing story.
Related work should then take into account the above arguments and provide a little detailed comparison to support those. Specifically, why traditional Supervised Relation Extraction techniques cannot address your problem? What is the strength and weakness of your approach compared to them?

Genericity of the proposed method:
I have doubt in how generalizable the method is. For the unsupervised method, as authors pointed out, they work when the domain and range of the relations are ‘sufficiently narrow’. The supervised method should be more generalizable because of learning. However this is unclear from the experiment (as all proposed models are as well generalizable to another relation), which I believe is because the scope of the experiment is too limited. The genericity may well be the strength of the supervised method. But much more thorough experimentation must be designed and performed. Authors should also follow a well-argued criteria in selecting relations for experiments: why locatedAt and usedFor? Did you rank them based on the importance, frequency of usage etc? Can you test your method on more relations selected based on this criteria?
Another related issue is the cold-start problem. From the way the method is explained, it seems that you always depend on the availability of a pre-compiled list of ‘entity-entity’ pairs that may potentially form certain relation. Your method then ‘confirms’ such relation; or with the supervised method, it selects one-vs-all. But do these pairs need to come from an existing KB? What if you do not have one? Will it still work for pairs extracted by classic IE methods, e.g., NER? Consider for example, run NER to extract from a free text corpus a set of entity-entity pairs and then use your model to predict relations between them. Will this still work? This would be more convincing because you show your method can be working for both classic IE problems where information is fine-grained and explicit in text and common sense knowledge when information is more general level and may be implicit.

Novelty of this work:
The main addition to the previous work [5] is the supervised model. However, experiment do not show particular advantages of this method hence the value of this work (compared to [5]) is questionable.
You also have contradictory claims on this. On page 3 you say ‘showing that the supervised approach clearly outperforms the approaches based on semantic relatedness for some evaluation measures.’; but this is not true according to experiments. Indeed in conclusion you then say ‘We have shown that the improvements of the supervised model are not always clear compared to the two unsupervised approaches.’
But as mentioned above, it could be due to the limited scope of the experiments, which does not help to truly reveal the strength of the supervised model. If you can test your method on more relations and have a few experiments that are aligned the task to the classic relation extraction setting, it may be clearer.

Other issues:
Experiment baseline: the first baseline is too weak and I do not think it makes much sense. Consider that your unsupervised method is based on semantic relatedness, I think you should include a baseline that uses a few state-of-the-art semantic relatedness measures. There are a lot of studies in this area, many uses WordNet, Wikipedia, or distributional semantics. Google for ‘semantic relatedness measures’ to start with. After all if a light-weight, simpler semantic relatedness measure can do as well as your unsupervised model or even better, what is the value of your unsupervised methods?
Page 7 ‘filter out odd or uncommon examples of objects or locations like Ghodiyu or Fainting_room.’ – how do you decide which ones are odd or uncommon?
Same page ‘We proceed analogously for the locations, selecting 20 common locations and thus obtain 2,000 objectlocation pairs in total.’ – wouldn’t top ranked locations be too general to be useful? Can you show examples?
Page 8 ‘131.067’ – I suppose you wanted to use , instead of . in all numbers
Page 11. the threshold is derived based on only one relation, how generalizable is it?
Page 12. ‘we score the test sets for each dataset with our presented methods and order the produced lists by the generated scores…’ – what is the ‘test set’? you never explained

Review #2
By Dagmar Gromann submitted on 25/Apr/2017
Suggestion:
Minor Revision
Review Comment:

This article compares two distributional semantic approaches - embedding-based and vector-representation - to extract and rank object relations by their prototypicality with two naive baselines based on co-occurrence frequency (link-frequency) and frequency of relation target (location-frequency). The two unsupervised distributional approaches are in turn compared to a third more refined supervised approach of a ranking function based on a neural network. Two semantic relations are utilized to exemplify the approach: locative and instrumental relations. The main novelty of the approach is provided by treating relation extraction as a ranking problem and providing a neural network approach to dynamically learn this ranking function. Furthermore, the approach contributes a knowledge base of validated locative and instrumental object-relation-target triples. A linear combination of the two distributional semantic approaches in their unsupervised form outperforms the other approaches.

The paper is well-written and structured and overall choices regarding data and methods are excellently motivated and detailed. It is original in terms of its triangulation of methods and thus provides significant results. However, the claims raised regarding cognitive robotics and object manipulation tasks are not supported by any evidence. Prototypicality scores or ranks are assigned to already specified triples of object-locatedAt-location triples and not extracted from text, which might be part of the reason why the unsupervised methods outperform the supervised methods in most evaluations. In spite of the title, the focus of the approach is on ranking object-relation-location triples by prototypicality rather than extraction semantic relations from text corpora. This somewhat limits the originality of the approach as well as the significance of the results.

Overall comments:
- It seems to me that the justification and motivation for using DBpedia as a base resources against which all others are linked/restricted is missing given its "lack of knowledge about common objects" (p. 4). Why not use ConceptNet or the SUN database as the main data source and generate an LOD resource?
- references are outdated and incomplete (from a content-perspective as well as regarding the provided references, see below)
- I would suggest changing the title from text corpora to semi-structured data

Comments by Section:
Abstract:
The superiority of the neural network approach over the unsupervised versions is not clearly or strongly supported by the data. In fact it rather seems like the linear combination of the two unsupervised approaches clearly outperforms the supervised neural network approach.

Introduction (1):
I am not entirely sure on the meaning of "perceive the world appropriately" - what is appropriate here? The provided view on cognitive robotics is somewhat simplistic and quite outdated - acquiring knowledge by self-experience is a vital part of constructive memory models that equally rely on previous conceptualizations (e.g. situatedness based on ideas from Barsalou (e.g. [1])). Other works have shown that learning spatial relations for object manipulation tasks benefits from dynamic learning processes as opposed to static templates (see e.g. [2]). Generalizations to new environments and variations in natural language when it comes to instructions require a more dynamic approach than a static knowledge base. Even though mapping natural language instructions to static knowledge has been done before (e.g. [3]), it is not scalable.

As regards the extraction of locative and instrumental relations from text, there is a whole field of research in cognitive linguistics on spatial relations (see e.g. [6]), their extraction from text (e.g. [5]), and their implications for object manipulation tasks (see e.g. [4]) and the same goes for intended use that should be considered here.

Related Work (2):
With the claim of providing a knowledge base for object manipulations of embodied agents, the related work section should include a section on contiguity relations in representing physical environments (see e.g. [2] for some pointers) as well as on extracting contiguity and instrumental relations from text (see above and maybe also the CogALex task 2016). While little can be gained from computational linguistics regarding spatial and instrumental relations, cognitive linguistics has been highly active in this regard (see above). I would also suggest at least mentioning the large number of lexical semantic relation databases (e.g [7] and [8]) that have been produced over the last decades. I think the claim that the resulting knowledge base contains substantial object knowledge is somewhat bold. This is implied in this section, while the lack of sensori-motor grounding is clearly acknowledged.

Method (3):
While I think the cosine similarity is a rather simplistic distance measure that could potentially be improved upon (see e.g. [9]), I think the scoring function is an excellent choice. I have one question regarding the negative triples: how could you ensure that the randomly generated negative triples not accidentially contained positive examples? I find the concept of a "correct" triple over an "incorrect" somewhat difficult with this dataset - a wallet might well be in the bathroom, however, its prototypical location might be different.

Dataset (4):
Locations are restricted to entities subcategorized as "Rooms". Does this restriction to "rooms" not create a difficulty in rendering the crowdsourced database comparable to the set extracted from the SUN database with a broader definition of locations?

Crowdsourcing:
- for how many of the 2,000 object-location pairs were pictures available for the task?
- how many participants provided the judgements?
- what is the average number of judgements per pair (with an extreme of 5 min and 100 max)?
- how many test questions did you have and what was the minimum trust level required on them?

SUN dataset:
- how many of the pairs were left after mapping to DBpedia?
- how big is the overlap between the SUN pairs and the crowdsourced/ranked pairs?

Evaluation (5):
Table 7: While Max represents the maximum of the respective column for the last two columns, the first has .913 as a maximum. Am I misunderstanding something here?

Building a knowledge base (6):
Is the resource publicly available?

References:
The references need to be corrected, for instance,
- [1] is missing the names of the authors,
- capitalization seems to be random,
- the pages are missing for [3],
- [11] is missing the title,
- [12] is a github link instead of the paper reference
- [18] states pages 2-2
- year in [21] is missing
- among many more.
The whole list of references needs to be thoroughly corrected.

Minor comments (in order of appearance):
p.1 (Abstract line 2): "extract of common sense knowledge" => "extract common sense knowledge"
p.2 ", e.g." => once with comma after "e.g.," once without "e.g." - please use one version consistently
p.2 "in which a just the cosine similarity" => "in which just the cosine similarity"
p.2 "teamOf [6] Instead," => there is a period missing between these two sentences
p.3 "the knowledge extract by our approach" => "extracted"
p.3 "senso-motorically grounded" => "sensor-motorically grounded"
p.3 "to train relation extraction approach" => to train their/a relation extraction approach"
p.3 "were tested in the past years in computational linguistics" => "were tested in computational linguistics in the past years"
p.4 "DBpedia, i.e. a lack" => "DBpedia, i.e., a lack" (comma missing several times)
p.4. "gold standard (see Section 5)" => shouldn't Section 4 on datasets be referenced here?
p.5 should "locatedAt" not be encoded as verbatim like in the other instances?
p. 7 "standing in the locatedAt with human judgements" => "standing"? relation missing? do word-pairs "stand" in a relation?
p. 7 "Crowdflower" => "CrowdFlower" (and later)
p.9 "other kind of relations" => "other kinds fo relations"
p. 11 "identiy matrix" => "identity matrix"

References used in the review above:
[1] Barsalou, Lawrence W. "Simulation, situated conceptualization, and prediction." Philosophical Transactions of the Royal Society of London B: Biological Sciences 364.1521 (2009): 1281-1289.
[2] Misra, D. K., Sung, J., Lee, K., & Saxena, A. (2016). Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research, 35(1-3), 281-300.
[3] Bollini M, Tellex S, Thompson T, Roy N and Rus D (2012) Inter-
preting and executing recipes with a cooking robot. In: ISER.
[4] Spranger, M. (2016). The evolution of grounded spatial language.
[5] Kordjamshidi, P., Van Otterlo, M., & Moens, M. F. (2011). Spatial role labeling: Towards extraction of spatial relations from natural language. ACM Transactions on Speech and Language Processing (TSLP), 8(3), 4.
[6] Talmy, L. (2005). The fundamental system of spatial schemas in language. From perception to meaning: Image schemas in cognitive linguistics, 3.
[7] Santus, E., Yung, F., Lenci, A., & Huang, C. R. (2015, July). EVALution 1.0: an evolving semantic dataset for training and evaluation of distributional semantic models. In Proceedings of the 4th Workshop on Linked Data in Linguistics (LDL-2015) (pp. 64-69).
[8] Hill, F., Reichart, R., & Korhonen, A. (2016). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics.
[9] Santus, E., Chersoni, E., Lenci, A., Huang, C. R., & Blache, P. (2016). Testing apsyn against vector cosine on similarity estimation. arXiv preprint arXiv:1608.07738.

Review #3
By Jedrzej Potoniec submitted on 26/Jun/2017
Suggestion:
Major Revision
Review Comment:

Below provided is a short summary along the main reviewing dimensions and the detailed remarks follow.
* originality: The paper is an extension of a conference paper from EKAW 2016. While the authors clearly present what is the new contribution of the paper, I believe some parts of the text could be better marked as originating from the EKAW paper.
* significance of the results: While the presented idea is interesting, the obtained experimental results are inconclusive. I also find some of the assumptions and decisions made by the authors unsatisfactorily justified.
* quality of writing: the manuscript is quite hard to follow, some things become clear only when reading the manuscript second time.

In my opinion, the presented manuscript has few drawbacks, that must be corrected before the paper can be accepted to a journal:
* The last paragraph of Section 1 states that the unsupervised models presented in Section 3 were already presented in the EKAW paper. This is fine with me, as the author clearly explain that the third method is the new contribution. My problem is with Sections 5.1.1, 5.1.2 and 6, which do not take the new contribution into account and seem to be a paraphrase of Sections 5.2-5.4 and 6 of the EKAW paper. In particular, Section 5.1.2 seems to be word-to-word identical with Section 5.4 of the EKAW paper. In my opinion, it should be clearly stated in the manuscript that these experiments were already described and/or they should be extended (more on this below).
* The authors formulate the relation extraction problem as a ranking problem. This is very interesting, but the manuscript does not explain what are the benefits of the approach. Moreover, it seems to me that the authors have a hard time escaping from the classification approach: the learning objective (Equation 2, Section 3.3) contains crisp labels of 1 and -1, what basically means that a perfect model would not generate a ranking at all, but just assign triples to two classes. In my opinion this resembles more classification with probability estimation than learning to rank. This is also visible during the evaluation in Sections 5 and 6, where the authors introduce various thresholds, that serve as a hyperparameter for classification and allow for controlling precision-recall trade-off.
* In the crowdsourcing approach there are numeric labels (2, 1, -1, -2) assigned to the possible decisions (usual, plausible etc.), which by itself is fine. The problem is, such labels should not be averaged, as this causes strange trade-offs, for example: Is it really true that two contributors saying “plausible” negate a single contributor saying “unexpected”? In my opinion, quadruples of normalized counts of each decision should be used to build a partial order, which then can be used as an incomplete, desired ranking. One possible method is as follows: assume that for the triple (spoon, locatedIn, kitchen) there were 10 contributors, out of which 7 said “usual” and 3 said “plausible”, we then obtain a quadruple (0.7, 0.3, 0, 0); for the triple (washing machine, locatedIn, kitchen) there were 20 contributors, 2 answering “usual”, 5 “plausible”, 10 “unusual”, 3 “unexpected”, yielding a quadruple (0.1, 0.25, 0.5, 0.15). As 0.7>0.1, 0.3>0.25, 0<0.5 and 0<0.15, we conclude that the first triple should be ranked higher than the second triple. Now assume that a third triple received the counts, respectively, 5, 0, 0, 5, and the corresponding normalized quadruple is (0.5, 0, 0, 0.5). While it clearly scored worser than the first triple, in my opinion it is incomparable with the second triple.
* The gathered datasets and the Keras model should be published.

Minor remarks:
* Abstract:
** “object(type)” - unless you already now the paper it is unclear what it means
** “ranking or scoring” - decide on a single term here and use it consistently throughout the paper
** The last sentence should be more precise and give some numerical values.
* Introduction:
** Page 2, left column, the paragraph “We present and compare(...)a certain object” is quite unclear and does not read well.
** Page 2, right column, “we propose three novel contributions”: there are four dashes, so something seems inconsistent here.
** Same place, last dash, sentence “Standard relations considered...”: the syntax of relation names is inconsistent (is-a vs. area_served vs. containedBy).
* Related work:
** In the machine reading paradigm I think http://www.semantic-web-journal.net/content/semantic-web-machine-reading... should be cited. I also think that NELL: Never-Ending Language Learning is a relevant project.
** There are numerous papers on DBpedia, at least one of them should be cited. The same goes for OpenCyC.
** There are some remarks that ConceptNet is not a LOD resource, but I do not really understand why this is relevant.
* Section 3:
** The first paragraph: what is a head/tail entity? From the rest of the paper and the venue I suspect that these triples are standard RDF triples, but why then do not call their parts as usual: a subject, a predicate, an object? If this is not the case, then please explain in more details what kind of triples are these and why such names are used.
** The next sentence in the same paragraph: what does it mean “more likely correct”? It is a very vague term
** The last sentence of the second paragraph: accepting/rejecting a triple is clearly a binary classification task.
* Section 3.1:
** The first paragraph: “corresponding vector representations V”. Why do you introduce the symbol V, which is not used anywhere in the vicinity?
** 4th paragraph, “DBpedia entity”: I find this somewhat vague. My guess is that by this you mean anything in the http://dbpedia.org/resource/ namespace and that the entity names occurring in the text are in truth local parts of IRIs. This should be clarified in the manuscript.
** The same paragraph: why did you not use rdfs:labels to compute vectors, but used the raw entity names instead?
** The same paragraph: were there any situations where there were words in the entities that were not present in the embeddings' vocabulary? If so, how did you deal with that?
* Section 3.2:
** Page 6, the last sentence above Table 2: in case of cosine similarity, 0 means orthogonal and -1 means opposite direction. I wonder to what extent this influences the ranking, as arguably -1 represents some relation between entities, while 0 represents no relation at all.
* Section 3.3:
** Did you need to use dropout? Were there any problems with a model without regularization? Why 0.1 and not the standard value of 0.5?
** The next paragraph: positive and negative triples suggest classification, not regression, so the description is quite strange. Also, at this point, I miss an explanation what exactly are these positive and negative triples.
** The next paragraph: I do not understand why gold standard does not provide negative triples. The crowdsourcing approach provides not only positive/negative split, but also a ranking to train on.
* Section 4.1:
** Second paragraph: The description of which entities were selected is quite vague, I would appreciate presenting the corresponding SPARQL queries.
** The third paragraph: a very complicated approach. Why did you not use Page Links dataset from DBpedia?
* Section 5 (as a whole): please make sure to clearly state which dataset is used for training and which for testing. For example, the last sentence in the introduction to Section 5 states that “test its quality against the manually created gold standard dataset” and it confuses me: I do not know if you mean by that yet another dataset, or is it a nickname for the dataset from the crowdsourcing approach described in Section 4.1.
* Section 5.1.1: There should be an equation for NDCG, a textual description is not enough.
* Section 5.1.2: I think the experiment should be extended by using a solver to find an optimal value of alpha and threshold. Especially in case of alpha it is unclear why such a coarse grid search should be enough.
* Section 5.1.3: Seeing that you initialize M_r to an identity matrix I am quite surprised that the neural network started to learn at all. I believe that the standard procedure is to initialize the weights randomly to avoid symmetry issues. Could you please comment on that?
* Section 5.2:
** Please select a single name for the set “plausible or usual”/”usual+plausible”/”usual” or “plausible” and stick with it. As of now, there are at least these three names scattered over the paper.
** I think equations for precision and recall could be useful here.
** The last sentence of the 4th paragraph: there is “and” missing before “496 pairs”.
** Figures 1 and 2 are barely readable in print and, as they are black and white in the PDF file, they are not much better on the screen. Also, I think a useful thing would be to plot a precision-recall chart, to directly show this trade-off.
** Figure 3: the legend obscures one of the lines.
* Section 6:
** In my opinion, there should be a discussion of a knowledge base generated with the supervised approach and/or a combination of the presented approaches yielding the best results.
** The generated knowledge bases should contain also the scores for each triple. As it is now, I can not decided if the mistakes in the KB are due to a too low threshold or if there are some problems with the method.
* References: some of the references are incomplete, usually missing are page numbers, book series, editors. Frequently, DBLP is a convenient source of high-quality BibTeX entries.