All You Need is "Leet": Evading Hate-speech Detection AI
Sampanna Yashwant Kahu, Naman Ahuja

TL;DR
This paper introduces black-box perturbation techniques that effectively evade state-of-the-art hate speech detection models with minimal semantic change, achieving an 86.8% success rate.
Contribution
It presents novel black-box attack methods to bypass hate speech detection AI with minimal alterations to the original text.
Findings
Successfully evades hate-speech detection in 86.8% of cases
Maintains minimal semantic change in original hate speech
Demonstrates vulnerability of current detection models
Abstract
Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Spam and Phishing Detection · Misinformation and Its Impacts
