Review Comment:
The authors propose automatic methods to refine and extend sentiment lexica
based on contextual information and on conceptual and semantic association
data. While there is a considerable body of work on using contextual
information to extend sentiment lexica, the application of semantic resources
is often limited to restricted synonymy datasets (WordNet). This article
aims at the precise research gap of evaluating how conceptual semantic data
from DBpedia can be used to improve lexica for sentiment analysis, in the
context of the application of SentiStrength to estimate the polarity of
Tweets.
The article explains a lot in detail the methods applied to extend the lexica,
but the statistical methods used to evaluate such improvement are insufficient
to conclude that there is a significant improvement, in particular since it is
relatively small. The description of statistical tests is basically
summarized in the sentence "Statistical significance is done using Wilcoxon
signed-rank test", and little to no detail is ever given about which
distributions are compared, what are the null hypotheses, or what is the
estimate of difference between medians. To conclude a statistically
significant improvement, the authors need to show a test that compares the
Precision, Recall, and F1 distributions resulting from the original method
versus their improvement. Furthermore, since the design includes three models
(context-based, conceptually-enriched, and semantically-adjusted relations)
and three applications (LU, LE, LUE), for each dataset the authors are testing
9 closely related hypothesis, and thus the p-values of the statistical tests
need to be corrected for multiple hypothesis testing.
Since the article aims at the extension with DBpedia data, to assess if that
data does or not increase performance, the same statistical tests should be
performed between the three models proposed by the authors. Unless the
performance of models is compared in a pairwise way, it is impossible to
conclude if the second and third model do really constitute an improvement.
Besides this major concern, I have other smaller comments:
- The article heavily builds on the lexicon of SentiStrength, and very briefly
argues about this choice without further support ("to the best of our
knowledge, it is currently one of the best performing lexicons"). The authors
need to refer to some benchmark study that shows why the chosen method can be
considered state of the art.
- While explained in good detail, there is little argumentation (end of p. 4)
about why the authors need the machinery of the SentiCircle approach. What do
they obtain computing polar coordinates that cannot be done in a simple
averaging with TF-IDF? Why is it desirable to have a dimensional
representation in which extremely negative and extremely positive terms are
very close to each other (area of large r and theta close to 180 degrees)?
What is the framing of the method with respect to simpler, previous
approaches?
- The rules presented in Table 1 are similar to those used to fine tune
SentiStrength in previous research, the authors should explain in detail the
resources that support the assumptions behind their decision to use those
rules in particular.
- The terminology to refer to lexica and methods is inconsistent. In some
cases the name of the method is used (SentiWordNet) and in others of one of
the authors of some paper using the method (Thelwall-Lexicon). I suggest that
the authors use a consistent terminology, for example referring to
SentiStrength, or citing all the methods by the initials of their authors.
- The use of pie charts in Figure 7 is totally unnecessary and the strange
vertical scale makes it very difficult to judge the actual values. A typical
bar chart or a table would better serve the purpose of that figure.
- The 14 positive and negative paradigm words (p. 9) need to be reported, and
the criterion for their choice has to be defended in the text.
- Two positive aspects help the reader to have a clearer idea of the results.
First, including the SO-PMI benchmark. SentiStrength, as fully unsupervised,
would not use global properties of the tweets in the evaluation dataset, but
the SO-PMI adds a naive approach that, to some extent, improves marginally the
original performance. Second, the descriptive insights on the amount of
changes explained in section 6.5, and on the role of balance in section 6.6.
This kind of post hoc analysis helps to understand better the conditions and
limitations of the results.
To sum up, the article needs important improvements before its main
conclusions can be considered as supported by the results. In particular, the
statistical tests need to be reported in much more detail to conclude that the
models are a significant improvement, and to which extent the semantic
relation and concept data extends the contextual adaptation.
|