DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak
Hao Wang, Hao Li, Junda Zhu, Xinyuan Wang, Chengwei Pan, MinLie Huang,, Lei Sha

TL;DR
This paper presents DiffusionAttacker, a novel diffusion-based method for generating effective LLM jailbreak prompts that outperform existing techniques in success rate, fluency, and diversity.
Contribution
The paper introduces a diffusion-driven, seq2seq approach for prompt rewriting, offering more flexible modifications and improved attack performance over autoregressive methods.
Findings
Outperforms previous methods in attack success rate
Achieves higher fluency and diversity in generated prompts
Demonstrates effectiveness on Advbench and Harmbench datasets
Abstract
Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking. As LLMs become more powerful, studying jailbreak methods is critical to enhancing security and aligning models with human values. Traditionally, jailbreak techniques have relied on suffix addition or prompt templates, but these methods suffer from limited attack diversity. This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models. Our method employs a sequence-to-sequence (seq2seq) text diffusion model as a generator, conditioning on the original prompt and guiding the denoising process with a novel attack loss. Unlike previous approaches that use autoregressive LLMs to generate jailbreak prompts, which limit the modification of already generated tokens and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Diffusion · Sequence to Sequence
