Review Comment:
Below provided is a short summary along the main reviewing dimensions and the detailed remarks follow.
* originality: The paper is an extension of a conference paper from EKAW 2016. While the authors clearly present what is the new contribution of the paper, I believe some parts of the text could be better marked as originating from the EKAW paper.
* significance of the results: While the presented idea is interesting, the obtained experimental results are inconclusive. I also find some of the assumptions and decisions made by the authors unsatisfactorily justified.
* quality of writing: the manuscript is quite hard to follow, some things become clear only when reading the manuscript second time.
In my opinion, the presented manuscript has few drawbacks, that must be corrected before the paper can be accepted to a journal:
* The last paragraph of Section 1 states that the unsupervised models presented in Section 3 were already presented in the EKAW paper. This is fine with me, as the author clearly explain that the third method is the new contribution. My problem is with Sections 5.1.1, 5.1.2 and 6, which do not take the new contribution into account and seem to be a paraphrase of Sections 5.2-5.4 and 6 of the EKAW paper. In particular, Section 5.1.2 seems to be word-to-word identical with Section 5.4 of the EKAW paper. In my opinion, it should be clearly stated in the manuscript that these experiments were already described and/or they should be extended (more on this below).
* The authors formulate the relation extraction problem as a ranking problem. This is very interesting, but the manuscript does not explain what are the benefits of the approach. Moreover, it seems to me that the authors have a hard time escaping from the classification approach: the learning objective (Equation 2, Section 3.3) contains crisp labels of 1 and -1, what basically means that a perfect model would not generate a ranking at all, but just assign triples to two classes. In my opinion this resembles more classification with probability estimation than learning to rank. This is also visible during the evaluation in Sections 5 and 6, where the authors introduce various thresholds, that serve as a hyperparameter for classification and allow for controlling precision-recall trade-off.
* In the crowdsourcing approach there are numeric labels (2, 1, -1, -2) assigned to the possible decisions (usual, plausible etc.), which by itself is fine. The problem is, such labels should not be averaged, as this causes strange trade-offs, for example: Is it really true that two contributors saying “plausible” negate a single contributor saying “unexpected”? In my opinion, quadruples of normalized counts of each decision should be used to build a partial order, which then can be used as an incomplete, desired ranking. One possible method is as follows: assume that for the triple (spoon, locatedIn, kitchen) there were 10 contributors, out of which 7 said “usual” and 3 said “plausible”, we then obtain a quadruple (0.7, 0.3, 0, 0); for the triple (washing machine, locatedIn, kitchen) there were 20 contributors, 2 answering “usual”, 5 “plausible”, 10 “unusual”, 3 “unexpected”, yielding a quadruple (0.1, 0.25, 0.5, 0.15). As 0.7>0.1, 0.3>0.25, 0<0.5 and 0<0.15, we conclude that the first triple should be ranked higher than the second triple. Now assume that a third triple received the counts, respectively, 5, 0, 0, 5, and the corresponding normalized quadruple is (0.5, 0, 0, 0.5). While it clearly scored worser than the first triple, in my opinion it is incomparable with the second triple.
* The gathered datasets and the Keras model should be published.
Minor remarks:
* Abstract:
** “object(type)” - unless you already now the paper it is unclear what it means
** “ranking or scoring” - decide on a single term here and use it consistently throughout the paper
** The last sentence should be more precise and give some numerical values.
* Introduction:
** Page 2, left column, the paragraph “We present and compare(...)a certain object” is quite unclear and does not read well.
** Page 2, right column, “we propose three novel contributions”: there are four dashes, so something seems inconsistent here.
** Same place, last dash, sentence “Standard relations considered...”: the syntax of relation names is inconsistent (is-a vs. area_served vs. containedBy).
* Related work:
** In the machine reading paradigm I think http://www.semantic-web-journal.net/content/semantic-web-machine-reading... should be cited. I also think that NELL: Never-Ending Language Learning is a relevant project.
** There are numerous papers on DBpedia, at least one of them should be cited. The same goes for OpenCyC.
** There are some remarks that ConceptNet is not a LOD resource, but I do not really understand why this is relevant.
* Section 3:
** The first paragraph: what is a head/tail entity? From the rest of the paper and the venue I suspect that these triples are standard RDF triples, but why then do not call their parts as usual: a subject, a predicate, an object? If this is not the case, then please explain in more details what kind of triples are these and why such names are used.
** The next sentence in the same paragraph: what does it mean “more likely correct”? It is a very vague term
** The last sentence of the second paragraph: accepting/rejecting a triple is clearly a binary classification task.
* Section 3.1:
** The first paragraph: “corresponding vector representations V”. Why do you introduce the symbol V, which is not used anywhere in the vicinity?
** 4th paragraph, “DBpedia entity”: I find this somewhat vague. My guess is that by this you mean anything in the http://dbpedia.org/resource/ namespace and that the entity names occurring in the text are in truth local parts of IRIs. This should be clarified in the manuscript.
** The same paragraph: why did you not use rdfs:labels to compute vectors, but used the raw entity names instead?
** The same paragraph: were there any situations where there were words in the entities that were not present in the embeddings' vocabulary? If so, how did you deal with that?
* Section 3.2:
** Page 6, the last sentence above Table 2: in case of cosine similarity, 0 means orthogonal and -1 means opposite direction. I wonder to what extent this influences the ranking, as arguably -1 represents some relation between entities, while 0 represents no relation at all.
* Section 3.3:
** Did you need to use dropout? Were there any problems with a model without regularization? Why 0.1 and not the standard value of 0.5?
** The next paragraph: positive and negative triples suggest classification, not regression, so the description is quite strange. Also, at this point, I miss an explanation what exactly are these positive and negative triples.
** The next paragraph: I do not understand why gold standard does not provide negative triples. The crowdsourcing approach provides not only positive/negative split, but also a ranking to train on.
* Section 4.1:
** Second paragraph: The description of which entities were selected is quite vague, I would appreciate presenting the corresponding SPARQL queries.
** The third paragraph: a very complicated approach. Why did you not use Page Links dataset from DBpedia?
* Section 5 (as a whole): please make sure to clearly state which dataset is used for training and which for testing. For example, the last sentence in the introduction to Section 5 states that “test its quality against the manually created gold standard dataset” and it confuses me: I do not know if you mean by that yet another dataset, or is it a nickname for the dataset from the crowdsourcing approach described in Section 4.1.
* Section 5.1.1: There should be an equation for NDCG, a textual description is not enough.
* Section 5.1.2: I think the experiment should be extended by using a solver to find an optimal value of alpha and threshold. Especially in case of alpha it is unclear why such a coarse grid search should be enough.
* Section 5.1.3: Seeing that you initialize M_r to an identity matrix I am quite surprised that the neural network started to learn at all. I believe that the standard procedure is to initialize the weights randomly to avoid symmetry issues. Could you please comment on that?
* Section 5.2:
** Please select a single name for the set “plausible or usual”/”usual+plausible”/”usual” or “plausible” and stick with it. As of now, there are at least these three names scattered over the paper.
** I think equations for precision and recall could be useful here.
** The last sentence of the 4th paragraph: there is “and” missing before “496 pairs”.
** Figures 1 and 2 are barely readable in print and, as they are black and white in the PDF file, they are not much better on the screen. Also, I think a useful thing would be to plot a precision-recall chart, to directly show this trade-off.
** Figure 3: the legend obscures one of the lines.
* Section 6:
** In my opinion, there should be a discussion of a knowledge base generated with the supervised approach and/or a combination of the presented approaches yielding the best results.
** The generated knowledge bases should contain also the scores for each triple. As it is now, I can not decided if the mistakes in the KB are due to a too low threshold or if there are some problems with the method.
* References: some of the references are incomplete, usually missing are page numbers, book series, editors. Frequently, DBLP is a convenient source of high-quality BibTeX entries.
|