Review Comment:
Note on the suggested decision: I think that a more suitable category for this paper is "revise and resubmit" but, at the same time, I find "reject" too strict.
The paper presents an interesting work on the use of a newly annotated resource such as TED Multilingual Discourse Bank to induce bilingual lexicons of discourse connectives. However I think this paper as is has a series of flaws/shortcomings that must be addressed:
1) the writing and the narrative must be more focused. I think that the authors fail to present all the work that is behind this paper. In summary, the authors are introducing a new language resource, presenting two methods to automatically align discourse relations between pairs of languages, use the automatically aligned data to induce bilingual lexicon of discourse connectives. This is a lot of material and I think that the presentation and the rationale of the paper is not clear enough.
2) The TED Multilingual Discourse Bank TED-MDB) : this is a new language resource, unique in its annotation layer. I think that more information should be given on properly present the resource: who has annotated it? how many annotators? is there any agreement study? how is the hierarchy of relations in PDT structured? how many classes? what is their meaning? why is the TED-MDB data format difficult? was any language specific adaptation of the guidelines needed? what happens if a discourse relations hold between two non-adjacent arguments: is it annotated or not?
The first 6 examples should help the readers following the different types of relations. I think that it would be more useful to present the different classes and immediately after provide the examples. All examples are in English except for example 2, why? I find this quite disturbing here.
You also fail to properly explain/introduce all relevant annotation layers. It is stated only at lines 40--42, pg 8, that there are gold annotations of aligned discourse relations. This is confusing for the reader. It also open to some questions on why no supervised methods have been attempted (not even a justification for not doing it is given).
3) Dictionary of discourse connective: this seems the core of the contribution, but the motivations for creating such resources are lacking, if not left implicit in the paper. Why do we need such resources? Why are they useful?
The authors claim that there has been an "upsurge" of discourse connective lexical, and then they only present 4. I think you can expand this with more pointers and presenting in more details such lexical (how have they been created?)
4) Alignment methods: two methods for aligning the annotations in multiple languages are presented. First, the presentations of the two methods needs to be rewritten. I think that the authors can shorten section 3 and be more precise and concise on what they did. It would also be interesting to know why these two methods have been selected rather than doing a manual alignment for all languages.
Method I: the main idea is to apply word alignment to identify the same portion of texts and then project the discourse relations from EN to any other aligned language. Why using NLTK sentence tokeniser? The module works for English, but for other languages either you use a different sentence tokeniser or it makes no sense. Unless you have used a specific module - in this case you have to make it clear. The presentation of the scoring method needs to be restructured: first states that you look for all alignments based on a connectives, then present the "overlap" approach.
Method II: first question I have is: why a 0.6 cosine similarity threshold? on the basis of what has the threshold been established? The manually aligned data? If so, this opens up quite serious doubts on the validity of your evaluation: it is like you have "trained" and tested on the same data. No surprise the results are quite good. It is also not easy to follow how the scoring is computed: you speak of similarity of discourse relations, but shouldn't it be the similarity of the arguments of a discourse relations?
When presenting the score between discourse relations matches (lines 23-33 pg 7): why do you first say a 1 or 0 score and then present the other scores according to the different layers of the relation taxonomy structure?
How is the semantic similarity between the connectives obtained? You apply LASER to tokens?
5) Evaluation: I would add some reference in support of the evaluation method you have adopted, this will better support your claims about the fact that in literature linking quality is evaluated using Precision, Recall and F1-score (these are very common methods). The section can be shortened and be more precise. Table 6 and 7 should be merged. You are comparing the results, so you must allow the readers to follow your comparison. The use of bold should be per pair of language per method - bold: methods that gives the best result. After having presented the results, I would open reflections on the errors and problems per method.
5) I find it difficult to follow the rationale behind the "retranslated" examples. They are not very useful. You can use Latex packages that align translations example (word by word) - e.g. gb4e - to make it easier to follow the potential gaps in the translations.
6) Section 4: The rationale of the section is a bit lacking in my opinion. For some languages you could have used the Gold data to conduct the analysis and reflect on the differences of realisations of connectives and types of discourse relations. I have some problems in following all the pairs in Figures 1 and 2. For some language pairs you could report the gold data, for others they are automatically aligned and with no evaluation (no gold available). It seems to me that it would make more sense to split this level of analysis: what happens at gold level first, and then what happens on the other languages for which no gold alignment is available. Are the tendencies the same or do they differ? This may provides additional insights in the evaluation of your approach.
7) The lexicon: this is one of the major product of this work but it is presented in a rushed way. I would loved to see an example of an entry of the lexicon so as to better understand its structure - especially the connection to the connective-lex.info web page.
I was not able to find any resource file accompanying the paper, nor a link to publicly available repository.
Other comments:
- lines 43-44, pg 7: please make the translation of the Turkish connective using a different font from the original. Otherwise it seems that in Turkish you have 6 connectives.
- lines 26 -36, pg 6: the paragraph contains 2 sentences. Swap their order, it makes the flow of information more coherent
- line 49 pg 6: what is a bi-text? define it!
- line 39 pg 4: RST --> resolve acronym first! Rhetorical Structure Theory (RST) + add reference
- line 4 pg 7: the score is scored --> the score is calculated/obtained
|