DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh

TL;DR
DrAttack introduces a novel prompt decomposition and reconstruction method that significantly improves jailbreak success rates against LLMs by obscuring malicious intent and reducing query counts.
Contribution
This paper presents an automatic prompt decomposition and reconstruction framework that enhances jailbreak effectiveness and stealthiness compared to existing prompt-only methods.
Findings
Achieves 78.0% success rate on GPT-4 with only 15 queries.
Outperforms prior state-of-the-art prompt-only attackers in success rate.
Reduces number of queries needed for successful jailbreaks.
Abstract
The safety alignment of Large Language Models (LLMs) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt \textbf{D}ecomposition and \textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack). DrAttack includes three key components: (a) `Decomposition' of the original prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Law, AI, and Intellectual Property
MethodsLinear Layer · Dropout · Layer Normalization · Byte Pair Encoding · Multi-Head Attention · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax
