Harnessing LLM to Attack LLM-Guarded Text-to-Image Models
Yimo Deng, Huangxun Chen

TL;DR
This paper introduces DACA, a multi-agent LLM-based method that effectively bypasses safety filters in text-to-image models by rephrasing prompts into benign descriptions, achieving high success rates in attacks.
Contribution
The paper presents a novel LLM-driven multi-agent approach for generating adversarial prompts that circumvent safety filters in T2I models, outperforming token replacement methods.
Findings
Achieves up to 76.7% success in one-time attacks on DALL-E 3
Achieves up to 84% success in re-use attacks on Midjourney
Open-sourced code and dataset for reproducibility
Abstract
To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass these filters, but they have become ineffective as nonsensical tokens fail semantic logic checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL-E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7% and 64% in the one-time attack, and 98% and 84% in the re-use attack,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection
