Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images
Aditya Kumar, Tom Blanchard, Adam Dziedzic, Franziska Boenisch

TL;DR
This paper identifies the vulnerability of state-of-the-art diffusion models to generate offensive NSFW text within images, and proposes a targeted fine-tuning method along with a new benchmark to mitigate this issue.
Contribution
It introduces a novel fine-tuning strategy focusing on text-generation layers and releases ToxicBench, a comprehensive benchmark for evaluating NSFW text in generated images.
Findings
All tested DMs are vulnerable to NSFW text generation.
Existing mitigation techniques fail to prevent harmful text without degrading benign content.
The proposed fine-tuning approach effectively reduces NSFW text while maintaining image quality.
Abstract
State-of-the-art Diffusion Models (DMs) produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we introduce a novel fine-tuning strategy that targets only the text-generation layers in DMs. Therefore, we construct a safety fine-tuning dataset by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHate Speech and Cyberbullying Detection
MethodsDiffusion
