DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM   Jailbreakers

Xirui Li; Ruochen Wang; Minhao Cheng; Tianyi Zhou; Cho-Jui Hsieh

arXiv:2402.16914·cs.CR·November 13, 2024·3 cites

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh

PDF

Open Access 1 Repo

TL;DR

DrAttack introduces a novel prompt decomposition and reconstruction method that significantly improves jailbreak success rates against LLMs by obscuring malicious intent and reducing query counts.

Contribution

This paper presents an automatic prompt decomposition and reconstruction framework that enhances jailbreak effectiveness and stealthiness compared to existing prompt-only methods.

Findings

01

Achieves 78.0% success rate on GPT-4 with only 15 queries.

02

Outperforms prior state-of-the-art prompt-only attackers in success rate.

03

Reduces number of queries needed for successful jailbreaks.

Abstract

The safety alignment of Large Language Models (LLMs) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt \textbf{D}ecomposition and \textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack). DrAttack includes three key components: (a) `Decomposition' of the original prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xirui-li/drattack
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Law, AI, and Intellectual Property

MethodsLinear Layer · Dropout · Layer Normalization · Byte Pair Encoding · Multi-Head Attention · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax