Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

Yimo Deng; Huangxun Chen

arXiv:2312.07130·cs.AI·November 27, 2024·1 cites

Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

Yimo Deng, Huangxun Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces DACA, a multi-agent LLM-based method that effectively bypasses safety filters in text-to-image models by rephrasing prompts into benign descriptions, achieving high success rates in attacks.

Contribution

The paper presents a novel LLM-driven multi-agent approach for generating adversarial prompts that circumvent safety filters in T2I models, outperforming token replacement methods.

Findings

01

Achieves up to 76.7% success in one-time attacks on DALL-E 3

02

Achieves up to 84% success in re-use attacks on Midjourney

03

Open-sourced code and dataset for reproducibility

Abstract

To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass these filters, but they have become ineffective as nonsensical tokens fail semantic logic checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL-E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7% and 64% in the one-time attack, and 98% and 84% in the re-use attack,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

researchcode001/divide-and-conquer-attack
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection