DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM   Jailbreak

Hao Wang; Hao Li; Junda Zhu; Xinyuan Wang; Chengwei Pan; MinLie Huang,; Lei Sha

arXiv:2412.17522·cs.CL·January 7, 2025

DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak

Hao Wang, Hao Li, Junda Zhu, Xinyuan Wang, Chengwei Pan, MinLie Huang,, Lei Sha

PDF

Open Access 1 Video

TL;DR

This paper presents DiffusionAttacker, a novel diffusion-based method for generating effective LLM jailbreak prompts that outperform existing techniques in success rate, fluency, and diversity.

Contribution

The paper introduces a diffusion-driven, seq2seq approach for prompt rewriting, offering more flexible modifications and improved attack performance over autoregressive methods.

Findings

01

Outperforms previous methods in attack success rate

02

Achieves higher fluency and diversity in generated prompts

03

Demonstrates effectiveness on Advbench and Harmbench datasets

Abstract

Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking. As LLMs become more powerful, studying jailbreak methods is critical to enhancing security and aligning models with human values. Traditionally, jailbreak techniques have relied on suffix addition or prompt templates, but these methods suffer from limited attack diversity. This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models. Our method employs a sequence-to-sequence (seq2seq) text diffusion model as a generator, conditioning on the original prompt and guiding the denoising process with a novel attack loss. Unlike previous approaches that use autoregressive LLMs to generate jailbreak prompts, which limit the modification of already generated tokens and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Diffusion · Sequence to Sequence