SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via   Substitution

Zhongjie Ba; Jieming Zhong; Jiachen Lei; Peng Cheng; Qinglong Wang,; Zhan Qin; Zhibo Wang; Kui Ren

arXiv:2309.14122·cs.CV·October 18, 2024·1 cites

SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

Zhongjie Ba, Jieming Zhong, Jiachen Lei, Peng Cheng, Qinglong Wang,, Zhan Qin, Zhibo Wang, Kui Ren

PDF

Open Access

TL;DR

This paper introduces SurrogatePrompt, a framework that systematically creates prompts to bypass safety filters in text-to-image models, revealing significant vulnerabilities and safety risks in current AI image generation systems.

Contribution

The paper presents the first prompt attack methodology on Midjourney, demonstrating how to evade safety filters using large language models and image modules, with an 88% success rate.

Findings

01

88% success rate in bypassing safety filters

02

Generated images depict political figures in violent scenarios

03

Revealed fundamental principles of prompt attacks

Abstract

Advanced text-to-image models such as DALL $\cdot$ E 2 and Midjourney possess the capacity to generate highly realistic images, raising significant concerns regarding the potential proliferation of unsafe content. This includes adult, violent, or deceptive imagery of political figures. Despite claims of rigorous safety mechanisms implemented in these models to restrict the generation of not-safe-for-work (NSFW) content, we successfully devise and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant photorealistic NSFW images. We reveal the fundamental principles of such prompt attacks and suggest strategically substituting high-risk sections within a suspect prompt to evade closed-source safety measures. Our novel framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning