Review Comment:
Summary:
The authors present an article for the task of hate speech detection. They present an analysis and results for publicly available English hate speech datasets that are based on Tweets. Additionally, they introduce a new dataset that was generated similarly to previously generated ones. In a first data analysis the authors show the ratio of unique words for each category (hate (e.g. racism, sexism, both) and non-hate). The authors claim that categories with higher ratios of unique words are easier to detect. In addition, they present plots analyzing how often tweets use the unique vocabulary of its category. Here it seems that more non-hate related tweets tend to use its vocabulary than the hate-related ones. The authors claim that this is one of the reasons why the detection of hate speech is complex.
Afterwards the authors introduce two neural network extensions. Both are based on base CNN. The first extensions adds a skipped CNN, and the second one appends a GRU layer to the base CNN. Furthermore, they test various pre-computed word embeddings trained on Google Books, Web text and Twitter. They show that the best vocabulary coverage is reached with the Twitter embeddings and combining all embeddings results in the lowest OOV rates. During their evaluation they also show results when training the embeddings from scratch during training. In addition, they show results when combining all existing embeddings. In the combination setting they use the non-Twitter embeddings as back-off. In their evaluation both extended CNN models outperform the Base CNN. However, the sCNN extension performs better than the GRU-based extension. Using different embedding, the authors could not find any clear trend and report that the usage of different embeddings is task dependent.
In addition to showing the overall results the authors present results for the different categories. Again, the introduced extensions of the base CNN outperform the baseline. In addition, they show the average performance of the baselines and the extensions with different embeddings. Furthermore, they show comparisons on the average F1 scores between hate classes. Here, the results heavily correlate with the average F1 scores between both hate and non-hate classes.
Next, they compare their results to already published methods. In comparison to the previous results they report micro-average F1. Here, the differences between the Base CNN and the extensions is marginal. However, the results outperform the previously reported results for most datasets. In addition, the authors give some examples for common errors.
Contribution & Strength:
- The authors present two extensions of a base CNN which they use for the task of English hate speech classification based on Twitter
- The authors show the impact of using different word embeddings for the task
- The evaluation is performed on three datasets. For one of the datasets they report results using differently generated labels (e.g. amateur annotators, specialists, etc.)
- the authors introduce a new dataset
- Vocabulary-based pre-analysis of the data
- source code seems to be available
Weakness:
- the paper requires careful proofreading. I have added some minor comments referring only to the introduction.
- the formatting should be correct in such a way that text does not exceed outside of the textual columns.
- The presentation of the results (especially Table 4, Fig. 6-8) does not become clear to me and should be simplified
- some clear analysis regarding the different word embeddings should have been shown. It seems odd that no clear trend is observed, especially as the task does not change
- I expect that a paper is understandable by reading the text without reading the captions of figures and tables. However, many figures (1,6,7,8,9) are hardly understandable when reading solely the text.
- The data analysis is nice but very general and short.
- There should be some further analysis why the improvements are not as significant for the Micro F1 scores in comparison to the Macro F1 scores.
Comments/Questions:
- In Section 5.6 the authors claim to have looked at 200 misclassified tweets. As the authors also describe error classes it would have been interesting to directly annotate the error classes in order to present some numbers about the common errors.
- the authors should remark that they concentrate on hate speech in English Tweets. Otherwise datasets for other languages should be also considered (e.g. https://github.com/ialfina/id-hatespeech-detection) as well in order to claim that they use "all publicly available datasets".
- I think the analysis shown in Table 2 is interesting to some extend. However, the authors do not explain the reason for this effect. For me, it seems that these uniqueness measure correlates with the ratio of the tweets for each category (e.g. the experts annotated less tweets as hate, thus the vocabulary for non-hate than for the amateurs). As the highest scores are achieved for the majority class of the category it is obvious that this class is easier to detect.
- Section 3.2.: Repeating the objectives of 1) and 2) when analyzing them would improve the understanding of the paper.
- Why are hashtags split? I would assume that such hashtags serve as a great feature for the detection of hate related tweets, which was also demonstrated by the authors when generating their dataset.
- why was the abuse class removed from the DT dataset? And how can results be compared to the DT baseline if the dataset has been changed?
- Figure 1: Reading solely the text, the plots do not become clear. For example, the x- and y-axis are not labeled. Furthermore, the legend should be placed more prominent (e.g. on the top).
- What is the effect of the backoff strategy for getting embeddings from the precomputed ones? Here, it would be interesting to compare the results to using random embeddings for OOV words?
- It would have been further interesting to not only use the backoff strategy for the embeddings but just stack the embedding together and use zeros/random numbers if the word is unknown for one of the embeddings. I would assume that the current strategy does not work well, especially, as the embeddings from a different model are within a different space.
- Section 5.1: How were the POS tags extracted on the Tweets?
- I understand that the paper presents lots of results. However, the representation of the results in Figure 6-8 is hard to understand. Even, after reading the caption in Fig 6 it did not entirely become clear. As far as I read the bar represents following value: Base+sCNN - (Base CNN). I would advise to present the results as numbers, which would also save a lot of space.
- For the WZ dataset with expert annotations, the sCNN modification achieves the highest improvements. Here also the Twitter-based embeddings perform best. It would be interesting to see whether using that classifier for the other datasets yields to good performance.
- The difference performance of the various embeddings is interesting. Especially, as the results is very different, especially on the variations of the different WZ datasets. Here some data analysis would be interesting to get to know why different embeddings perform better/inferior although the task is very similar.
- nice article that lists all available datasets and results: https://repositorio-aberto.up.pt/handle/10216/106028
- SVM results are missing Table 5. These would have been more interesting than showing results for all embeddings types.
Minor:
- Section 2.1: the example Tweets do not fit into the columns. The same applies to Section 3.3 and 4.2.3
- introduction: "hate based" -> hate-based
- Introduction: "speech - eventually leading to hate crime - effortless" -> "speech -- eventually leading to hate crime -- effortless"
- Introduction: "This correlates to record spikes..." -> sentence ends strange and might improve by re-phrasing.
- EEA might not be known to every reader
- Introduction: "This work studies automatic" -> might want to re-phrase as for me it reads like a garden path sentence
- "that help address it" -> "that help to address it"
- "we undertake research" -> rephrase
- " sets new benchmark for future research" -> sets a new benchmark...
- Related work: Word generalization includes examples ... -> sentence too long and hard to understand
- 2.3. Dataset: "It is widely recognized" -> not sure what the authors try to express.
- Fig. 6&7: y axis not named
- Fig 6&9 and 7&9 have the same title, however, they present different results
- Table 5: check that all numbers have 3 digits.
|