All You Need is "Leet": Evading Hate-speech Detection AI

Sampanna Yashwant Kahu; Naman Ahuja

arXiv:2505.16263·cs.CR·May 23, 2025

All You Need is "Leet": Evading Hate-speech Detection AI

Sampanna Yashwant Kahu, Naman Ahuja

PDF

Open Access 1 Repo

TL;DR

This paper introduces black-box perturbation techniques that effectively evade state-of-the-art hate speech detection models with minimal semantic change, achieving an 86.8% success rate.

Contribution

It presents novel black-box attack methods to bypass hate speech detection AI with minimal alterations to the original text.

Findings

01

Successfully evades hate-speech detection in 86.8% of cases

02

Maintains minimal semantic change in original hate speech

03

Demonstrates vulnerability of current detection models

Abstract

Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sampannakahu/all_you_need_is_leet
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Spam and Phishing Detection · Misinformation and Its Impacts