A study of text representations in Hate Speech Detection
Chrysoula Themeli, George Giannakopoulos, Nikiforos Pittaras

TL;DR
This paper evaluates various text representation methods for hate speech detection on social media, finding that simple keyword frequency features combined with classifiers yield the best results.
Contribution
It systematically compares multiple text representations and classifiers, highlighting the effectiveness of simple features like BoW and N-gram graphs for hate speech detection.
Findings
BoW features outperform other representations.
Pre-trained embeddings like GLoVe are effective.
Combining representations with Logistic Regression yields top performance.
Abstract
The pervasiveness of the Internet and social media have enabled the rapid and anonymous spread of Hate Speech content on microblogging platforms such as Twitter. Current EU and US legislation against hateful language, in conjunction with the large amount of data produced in these platforms has led to automatic tools being a necessary component of the Hate Speech detection task and pipeline. In this study, we examine the performance of several, diverse text representation techniques paired with multiple classification algorithms, on the automatic Hate Speech detection and abusive language discrimination task. We perform an experimental evaluation on binary and multiclass datasets, paired with significance testing. Our results show that simple hate-keyword frequency features (BoW) work best, followed by pre-trained word embeddings (GLoVe) as well as N-gram graphs (NGGs): a graph-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Internet Traffic Analysis and Secure E-voting
MethodsLogistic Regression
