Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail on Twitter

Tracking #: 1832-3045

Authors: 
Ziqi Zhang
Lei Luo

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

<
Submission type: 
Full Paper
Abstract: 
In recent years, the increasing propagation of hate speech on social media and the urgent need for effective counter-measures have drawn significant investment from governments, companies, and empirical research. Despite a large number of emerging, scientific studies to address the problem, the performance of existing automated methods at identifying specific types of hate speech - as opposed to identifying non-hate - is still very unsatisfactory, and the reasons behind are poorly understood. This work undertakes the first in-depth analysis towards this problem and shows that, the very challenging nature of identifying hate speech on the social media is largely due to the extremely unbalanced presence of real hateful content in the typical datasets, and the lack of unique, discriminative features in such content, both causing them to reside in the ‘long tail’ of a dataset that is difficult to discover. To address this issue, we propose novel Deep Neural Network structures serving as effective feature extractors, and explore the usage of background information in the form of different word embeddings pre-trained from unlabelled corpora. We empirically evaluate our methods on the largest collection of hate speech datasets based on Twitter, and show that our methods can significantly outperform state of the art, as they are able to obtain a maximum improvement of between 4 and 16 percentage points (macro-average F1) depending on datasets.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Martin Riedl submitted on 15/Apr/2018
Suggestion:
Major Revision
Review Comment:

Summary:

The authors present an article for the task of hate speech detection. They present an analysis and results for publicly available English hate speech datasets that are based on Tweets. Additionally, they introduce a new dataset that was generated similarly to previously generated ones. In a first data analysis the authors show the ratio of unique words for each category (hate (e.g. racism, sexism, both) and non-hate). The authors claim that categories with higher ratios of unique words are easier to detect. In addition, they present plots analyzing how often tweets use the unique vocabulary of its category. Here it seems that more non-hate related tweets tend to use its vocabulary than the hate-related ones. The authors claim that this is one of the reasons why the detection of hate speech is complex.

Afterwards the authors introduce two neural network extensions. Both are based on base CNN. The first extensions adds a skipped CNN, and the second one appends a GRU layer to the base CNN. Furthermore, they test various pre-computed word embeddings trained on Google Books, Web text and Twitter. They show that the best vocabulary coverage is reached with the Twitter embeddings and combining all embeddings results in the lowest OOV rates. During their evaluation they also show results when training the embeddings from scratch during training. In addition, they show results when combining all existing embeddings. In the combination setting they use the non-Twitter embeddings as back-off. In their evaluation both extended CNN models outperform the Base CNN. However, the sCNN extension performs better than the GRU-based extension. Using different embedding, the authors could not find any clear trend and report that the usage of different embeddings is task dependent.

In addition to showing the overall results the authors present results for the different categories. Again, the introduced extensions of the base CNN outperform the baseline. In addition, they show the average performance of the baselines and the extensions with different embeddings. Furthermore, they show comparisons on the average F1 scores between hate classes. Here, the results heavily correlate with the average F1 scores between both hate and non-hate classes.

Next, they compare their results to already published methods. In comparison to the previous results they report micro-average F1. Here, the differences between the Base CNN and the extensions is marginal. However, the results outperform the previously reported results for most datasets. In addition, the authors give some examples for common errors.

Contribution & Strength:

- The authors present two extensions of a base CNN which they use for the task of English hate speech classification based on Twitter

- The authors show the impact of using different word embeddings for the task

- The evaluation is performed on three datasets. For one of the datasets they report results using differently generated labels (e.g. amateur annotators, specialists, etc.)

- the authors introduce a new dataset

- Vocabulary-based pre-analysis of the data

- source code seems to be available

Weakness:

- the paper requires careful proofreading. I have added some minor comments referring only to the introduction.

- the formatting should be correct in such a way that text does not exceed outside of the textual columns.

- The presentation of the results (especially Table 4, Fig. 6-8) does not become clear to me and should be simplified

- some clear analysis regarding the different word embeddings should have been shown. It seems odd that no clear trend is observed, especially as the task does not change

- I expect that a paper is understandable by reading the text without reading the captions of figures and tables. However, many figures (1,6,7,8,9) are hardly understandable when reading solely the text.

- The data analysis is nice but very general and short.

- There should be some further analysis why the improvements are not as significant for the Micro F1 scores in comparison to the Macro F1 scores.

Comments/Questions:

- In Section 5.6 the authors claim to have looked at 200 misclassified tweets. As the authors also describe error classes it would have been interesting to directly annotate the error classes in order to present some numbers about the common errors.

- the authors should remark that they concentrate on hate speech in English Tweets. Otherwise datasets for other languages should be also considered (e.g. https://github.com/ialfina/id-hatespeech-detection) as well in order to claim that they use "all publicly available datasets".

- I think the analysis shown in Table 2 is interesting to some extend. However, the authors do not explain the reason for this effect. For me, it seems that these uniqueness measure correlates with the ratio of the tweets for each category (e.g. the experts annotated less tweets as hate, thus the vocabulary for non-hate than for the amateurs). As the highest scores are achieved for the majority class of the category it is obvious that this class is easier to detect.

- Section 3.2.: Repeating the objectives of 1) and 2) when analyzing them would improve the understanding of the paper.

- Why are hashtags split? I would assume that such hashtags serve as a great feature for the detection of hate related tweets, which was also demonstrated by the authors when generating their dataset.

- why was the abuse class removed from the DT dataset? And how can results be compared to the DT baseline if the dataset has been changed?

- Figure 1: Reading solely the text, the plots do not become clear. For example, the x- and y-axis are not labeled. Furthermore, the legend should be placed more prominent (e.g. on the top).

- What is the effect of the backoff strategy for getting embeddings from the precomputed ones? Here, it would be interesting to compare the results to using random embeddings for OOV words?

- It would have been further interesting to not only use the backoff strategy for the embeddings but just stack the embedding together and use zeros/random numbers if the word is unknown for one of the embeddings. I would assume that the current strategy does not work well, especially, as the embeddings from a different model are within a different space.

- Section 5.1: How were the POS tags extracted on the Tweets?

- I understand that the paper presents lots of results. However, the representation of the results in Figure 6-8 is hard to understand. Even, after reading the caption in Fig 6 it did not entirely become clear. As far as I read the bar represents following value: Base+sCNN - (Base CNN). I would advise to present the results as numbers, which would also save a lot of space.

- For the WZ dataset with expert annotations, the sCNN modification achieves the highest improvements. Here also the Twitter-based embeddings perform best. It would be interesting to see whether using that classifier for the other datasets yields to good performance.

- The difference performance of the various embeddings is interesting. Especially, as the results is very different, especially on the variations of the different WZ datasets. Here some data analysis would be interesting to get to know why different embeddings perform better/inferior although the task is very similar.

- nice article that lists all available datasets and results: https://repositorio-aberto.up.pt/handle/10216/106028

- SVM results are missing Table 5. These would have been more interesting than showing results for all embeddings types.

Minor:

- Section 2.1: the example Tweets do not fit into the columns. The same applies to Section 3.3 and 4.2.3

- introduction: "hate based" -> hate-based

- Introduction: "speech - eventually leading to hate crime - effortless" -> "speech -- eventually leading to hate crime -- effortless"

- Introduction: "This correlates to record spikes..." -> sentence ends strange and might improve by re-phrasing.

- EEA might not be known to every reader

- Introduction: "This work studies automatic" -> might want to re-phrase as for me it reads like a garden path sentence

- "that help address it" -> "that help to address it"

- "we undertake research" -> rephrase

- " sets new benchmark for future research" -> sets a new benchmark...

- Related work: Word generalization includes examples ... -> sentence too long and hard to understand

- 2.3. Dataset: "It is widely recognized" -> not sure what the authors try to express.

- Fig. 6&7: y axis not named

- Fig 6&9 and 7&9 have the same title, however, they present different results

- Table 5: check that all numbers have 3 digits.

Review #2
Anonymous submitted on 23/Apr/2018
Suggestion:
Minor Revision
Review Comment:

This paper proposed a deep learning architecture for the task of hate speech detection. The authors also did an insightful study of current datasets for hate detection. The experiments deployed on these datasets show that the proposed method achieved promising results compared to state-of-the-art methods.

In general, the paper is well written and structured. The approach is sound, and the experiments are well-designed and well-executed. However, the paper has some shortcomings. First, there is lacking intuition behind the proposed neural network. Given that there are many other neural networks in the literature for hate detection, it is not clear what is the superior of the proposed architectures, and why they use these layers. Second, the contribution of "exploring the effect of using background information" is weak. In fact, testing on different embeddings should not be considered as a contribution but a simple network tuning. Finally, the comparison to other state-of-the-art methods is not quite convinced. The authors should compare the results based on the micro F1 rather than macro F1 due to the imbalance of classes. In addition, the author mentioned that they reimplement the work of [12], but the performance on two classes seems low compared to the original paper in three classes.

Review #3
By Armand Vilalta submitted on 13/May/2018
Suggestion:
Major Revision
Review Comment:

Summary
The paper studies the problem of hate speech detection on Twitter. It identifies as the main challenge the unbalance between hate and non-hate classes in current publicly available datasets and the lack of unique features in the data. First, a novel dataset for the problem is provided with detailed explanations on its construction. Following, the study computes some ad-hoc statistics on the dataset proposed and on other ones publicly available to characterize the data. Several commonly used word embeddings are proposed and tested. And finally 2 different Deep Neural Networks (DNN) are proposed to classify the twits. The results include different combinations of the proposed options claiming state-of-the-art results in several datasets.

General impression
My general impression with this paper is that it makes a general, broad study on the topic without focusing in a specific improvement over existing literature. Most of the content of the paper has already been studied and in my opinion it lacks a significant novel contribution. I consider that the authors try to fit in the paper a quite broad study which may be too much for a single paper. I would recommend to focus on really novel contributions and do not try to fit everything in which leads to vague imprecise claims.

Writing and presentation
The paper is in general well written although sometimes it is repetitive. For instance, words “long tail” appear 16 times in the paper, repeating the same idea in different parts of it. Figures are too complicated because a lot of information is fitted in a single figure.

Per sections review
1. Introduction
It states:
This work explores two research questions concerning this problem: 1) what makes the detection of
hate speech difficult, and 2) what are the methods that help address it.
This objectives make me think this is a review paper instead of a full paper, as it was submitted.

Following, we find the answer to the first question: we show that hate speech exhibits a ‘long tail’ pat-
tern compared to non-hate content due to their lack of unique, discriminative features, and this makes them very challenging to identify.
Other works have hypothesized and studied more detailed reasons that make this problem specially difficult. The long-tail pattern is found in many problems with important class imbalance (i.e. medical diagnosis, anomaly detection,...) so it is not particular of the studied problem, and, in my opinion it is not the biggest challenge to face. See for instance the review by Schmidt (ref [38] in the paper) where the first challenge is the cultural background dependency. In my opinion this is not a deep study on the causes of such difficulty.

To address the second question authors propose the use of deep neural networks and pre-trained word embeddings. This solution is not new at all. Use of recurrent neural networks has become quite common in NLP tasks in general, and has been applied to the same problem in particular by Mehdad (ref [28] in the paper).

I would ask authors to explain more clearly the contributions of the paper.

2. Related Work

It is good that exists a part of terminology. The subtle factors that differentiate hate speech are important to know to be able to tackle the problem.
The rest of the related work is in my opinion too long. It seems a summary of a review paper and is not always clear the relation with present work.
In sections 4.2.1, 4.2.2, 4.2.3 and 4.3 there exists a paragraph of “comparison with state of the art” which is likely to be the related work.
I consider that this section should be rewritten keeping only the related work so that is not necessary to add a “comparison with state of the art” paragraph in the future sections.
The section 2.5 remark is completely unnecessary in this section.

3. Dataset Analisis – the case of Long Tail

About the dataset RM created by the authors. I would like to ask: what does this dataset add that was not present in previous datasets? What is its value?
In section 5.3 General Results explains that for several datasets the class “both” has very few instances, which obscures the results. Have you considered to remove it?

About section 3.2 Dataset Analysis. Although I have not seen in any place “bag of words”, if I understood correctly, all analysis in this section is under this hypothesis. Word relation in the sentence or character lever information are not taken into account, only presence or absence of a complete word.
I wonder why? In [38, 28] seem clear that character level information (character n-gram) should be preferred for various reasons.

The findings are not specially surprising: You can hate or not using the same words.
The tables and figures of this section are specially complicated to understand and to extract a clear picture of what is happening.

4. Methodology

I wonder if using word embeddings is a better choice than using character embeddings. A choice that is not reasoned.

About the last word embedding e.all: I consider a mistake the way it is created using words from different embeddings: Each embedding keeps an internal structure, setting close related concepts and distant different concepts. Imagine we have 2 identical but simetrical embeddings, if word “a” first embedding is (1,2,3), the second embeding is (-1,-2,-3). If we take word “a” from first embedding and word “b”, semantically very similar to “a”, from the second, they will be distant in the resulting embedding while they are not in reality nor in any of the embeddings.

If the results using different word embeddings are similar I would prefer to stick to one embedding for the rest of the experiments to reduce the number of factors involved and simplify presentation and interpretation of results.

The DNN Structures chosen are well known for text processing. Some modifications have been applied. For instance it is typical to use the GRU’s last time-step hidden units activation's as a representation of the sentence. Here authors chose to take a global max pooling. Why? Is it better than the “regular” option?
Many of the choices in this section are not justified. Well, this is quite common for DNN architectures but at least some intuition about the rationale of those choices is expectable, specially if they are small modifications of well know models. Also ablation studies are useful to detect the important parts of the net and understand better why it works. In any case, a comparison with state-of-the-art models can validate the choices.

5. Experiment

Evaluation:
It is clear to me that hate speech is a problem. But, what is actually the problem? To detect all of it, so we should look at recall of hate class? To only detect it, so we should look at precision of hate class?
I would really appreciate a study in such direction, starting by defining the metrics most suitable for the problem and the rationale behind.
As in many studies we do have lots of different metrics, from which we do compute more metrics (ie. F1-score) which we finally aggregate (i.e. macro average) to end up losing the meaning of them.

As said before there is a lot of information presented in this paper and this section is for obvious reasons the most affected. I strongly recommend to reduce the amount of information presented to simplify its presentation.

Figures 6-9 are a trap where is too easy to get lost. They are presenting improvements by one method compared with other different methods on different datasets and for different word embeddings. Difficult to understand? Add that the bar length represent the difference with a reference that is different for each bar, so a longer bar can actually mean nothing at all when comparing methods and options.
I do not think that those differences are more important than absolute performance. At least they can be seen in bar length differences in a “regular” bar plot. If information presented here is reduced a simpler presentation could be obtained. Important results can always be highlighted in the text.