|Review Comment: |
This is a resubmission. Therefore I will only focus on the issues I raised in my previous reviews:
(1) Some aspected of the experimental protocol are unclear: which significance test was used for evaluation, cross-validation used across all experiments?
(2) Missing related work, in particular on existing deep learning approaches for task at hand.
(3) Justification of the deep architecture used
(4) Reducing the number of times the word “deep” is used.
The authors have placed a general reference to deep learning and have included
several additional reference; from originally 20 up to 39. In particular, they have related the present paper to several of the references provided in my first review.
Thanks! So, (2) has been nicely addressed. And, the authors now justify the CNN architecture used by describing it in more details and providing a reference. As the present paper is less about the particular architecture, this is more than fine. So (3) has been addressed. Thanks! Also, the authors switch from “deep” to “neural” at several occasions, which I agree is the better alternative. Thanks! So, (4) has been addressed,
although it is not clear to me, why a McNemar test was used and not some t-test. The McN
is for nominal values (as far as I know) and given that we here are interested
in accuracy etc. a t-test would have been more natural, in my opinion. I guess, the
authors considered the win-loss ratio, which is fine. Overall, I would accept the
test, but it would be nice to have an explanation here. I leave this (whether this is important as well as if so the check of the revision) to the editor.
The only still remaining downside, as far as I see, is the still unclear protocol
for the first experiment, the evaluation of the CRF features. As the authors do not
touch upon the cross-validation setting, I read this part again. I noticed that
the neural embeddings are learned on the complete dataset. This is unfair as
the embCRF has now more knowledge of the dataset as the standard one. In turn, the
results in Table 3 have to be better justified. Is is because of using the
full training set for embCRF? This has to be clarified before publication. As the same
features have been used in all other experiments, the other experiments should be checked,
too. Or, as asked for in my previous review (sorry if this was not clear) the authors
should justify the experimental setup. Generally, a significance test should be run everywhere, but I leave this decision to the editor. So contrary to what the authors
argue, all classifiers has not been trained on the same data (at least potentially).
To summarise, (2-4) have been addressed well. Thanks! (1) has been addressed partly
and raised a new, more refined issue.