GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models
Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma, Yu-Gang Jiang

TL;DR
GenBreak introduces a novel framework that fine-tunes large language models to systematically generate adversarial prompts, exposing safety vulnerabilities in text-to-image generators and highlighting the need for improved content moderation.
Contribution
The paper presents a new method combining supervised fine-tuning and reinforcement learning to craft effective adversarial prompts for T2I models, addressing limitations of prior red-teaming approaches.
Findings
Effective black-box attacks on commercial T2I models
Revealed significant safety vulnerabilities in current systems
Generated prompts with high toxicity and diversity
Abstract
Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and are now widely used in content creation. However, these models can be misused to generate harmful content, including nudity or violence, posing significant safety risks. While most platforms employ content moderation systems, underlying vulnerabilities can still be exploited by determined adversaries. Recent research on red-teaming and adversarial attacks against T2I models has notable limitations: some studies successfully generate highly toxic images but use adversarial prompts that are easily detected and blocked by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts. Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models. To address this gap, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
MethodsFocus · Diffusion
