Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems

Sergey Berezin; Reza Farahbakhsh; Noel Crespi

arXiv:2409.18708·cs.CL·September 25, 2025

Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems

Sergey Berezin, Reza Farahbakhsh, Noel Crespi

PDF

Open Access 1 Repo

TL;DR

This paper presents ASCII-art based adversarial attacks that exploit vulnerabilities in toxicity detection models, demonstrating that current moderation systems are highly susceptible to visually obfuscated inputs, and introduces ToxASCII as a benchmark for robustness evaluation.

Contribution

It introduces a novel ASCII-art based attack method and a benchmark to evaluate toxicity detection robustness against spatially structured text attacks.

Findings

01

Attacks achieve 100% success rate across multiple models.

02

Current toxicity detection systems are highly vulnerable to ASCII-art obfuscation.

03

ToxASCII provides a new standard for robustness testing.

Abstract

We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models' failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Serbernari/ToxASCII
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Digital and Cyber Forensics

MethodsLLaMA