Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Jiachen Ma; Yijiang Li; Zhiqing Xiao; Anda Cao; Jie Zhang; Chao Ye; Junbo Zhao

arXiv:2404.02928·cs.CR·May 27, 2025·3 cites

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Jiachen Ma, Yijiang Li, Zhiqing Xiao, Anda Cao, Jie Zhang, Chao Ye, Junbo Zhao

PDF

Open Access 1 Video

TL;DR

This paper introduces Jailbreaking Prompt Attack (JPA), a fast, universal method to generate harmful images from diffusion models by exploiting NSFW concepts in text embeddings without needing model access.

Contribution

JPA is a novel, efficient attack technique that bypasses safety filters in diffusion models by manipulating text embeddings, without requiring model access or lengthy optimization.

Findings

01

JPA successfully bypasses safety checkers in multiple T2I models.

02

JPA maintains high semantic alignment with target prompts.

03

JPA operates faster and more automatically than previous methods.

Abstract

Text-to-image (T2I) models can be maliciously used to generate harmful content such as sexually explicit, unfaithful, and misleading or Not-Safe-for-Work (NSFW) images. Previous attacks largely depend on the availability of the diffusion model or involve a lengthy optimization process. In this work, we investigate a more practical and universal attack that does not require the presence of a target model and demonstrate that the high-dimensional text embedding space inherently contains NSFW concepts that can be exploited to generate harmful images. We present the Jailbreaking Prompt Attack (JPA). JPA first searches for the target malicious concepts in the text embedding space using a group of antonyms generated by ChatGPT. Subsequently, a prefix prompt is optimized in the discrete vocabulary space to align malicious concepts semantically in the text embedding space. We further introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Smart Grid Security and Resilience · Advanced Malware Detection Techniques

MethodsDiffusion · ALIGN · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Language-Image Pre-training