All You Need is "Love": Evading Hate-speech Detection
Tommi Gr\"ondahl, Luca Pajola, Mika Juuti, Mauro Conti, N. Asokan

TL;DR
This paper demonstrates that current hate speech detection models are highly sensitive to data type and adversarial manipulations, highlighting the need for more robust approaches beyond architecture improvements.
Contribution
It reproduces state-of-the-art models, shows their limitations against adversarial attacks, and emphasizes data quality and character-level features for improved robustness.
Findings
Models perform well only on data similar to training data.
Adversarial attacks with typos and word modifications are highly effective.
Character-level features increase attack resistance.
Abstract
With the spread of social networks and their unfortunate use for hate speech, automatic detection of the latter has become a pressing problem. In this paper, we reproduce seven state-of-the-art hate speech detection models from prior work, and show that they perform well only when tested on the same type of data they were trained on. Based on these results, we argue that for successful hate speech detection, model architecture is less important than the type of data and labeling criteria. We further show that all proposed detection techniques are brittle against adversaries who can (automatically) insert typos, change word boundaries or add innocuous words to the original hate speech. A combination of these methods is also effective against Google Perspective -- a cutting-edge solution from industry. Our experiments demonstrate that adversarial training does not completely mitigate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
