Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

Aditya Kumar; Tom Blanchard; Adam Dziedzic; Franziska Boenisch

arXiv:2502.05066·cs.CV·January 16, 2026

Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

Aditya Kumar, Tom Blanchard, Adam Dziedzic, Franziska Boenisch

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper identifies the vulnerability of state-of-the-art diffusion models to generate offensive NSFW text within images, and proposes a targeted fine-tuning method along with a new benchmark to mitigate this issue.

Contribution

It introduces a novel fine-tuning strategy focusing on text-generation layers and releases ToxicBench, a comprehensive benchmark for evaluating NSFW text in generated images.

Findings

01

All tested DMs are vulnerable to NSFW text generation.

02

Existing mitigation techniques fail to prevent harmful text without degrading benign content.

03

The proposed fine-tuning approach effectively reduces NSFW text while maintaining image quality.

Abstract

State-of-the-art Diffusion Models (DMs) produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we introduce a novel fine-tuning strategy that targets only the text-generation layers in DMs. Therefore, we construct a safety fine-tuning dataset by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sprintml/toxicbench
pytorchOfficial

Videos

Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images· underline

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsDiffusion