GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models

Zilong Wang; Xiang Zheng; Xiaosen Wang; Bo Wang; Xingjun Ma; Yu-Gang Jiang

arXiv:2506.10047·cs.CR·June 13, 2025

GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models

Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma, Yu-Gang Jiang

PDF

Open Access

TL;DR

GenBreak introduces a novel framework that fine-tunes large language models to systematically generate adversarial prompts, exposing safety vulnerabilities in text-to-image generators and highlighting the need for improved content moderation.

Contribution

The paper presents a new method combining supervised fine-tuning and reinforcement learning to craft effective adversarial prompts for T2I models, addressing limitations of prior red-teaming approaches.

Findings

01

Effective black-box attacks on commercial T2I models

02

Revealed significant safety vulnerabilities in current systems

03

Generated prompts with high toxicity and diversity

Abstract

Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and are now widely used in content creation. However, these models can be misused to generate harmful content, including nudity or violence, posing significant safety risks. While most platforms employ content moderation systems, underlying vulnerabilities can still be exploited by determined adversaries. Recent research on red-teaming and adversarial attacks against T2I models has notable limitations: some studies successfully generate highly toxic images but use adversarial prompts that are easily detected and blocked by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts. Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models. To address this gap, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis

MethodsFocus · Diffusion