Making Them Ask and Answer: Jailbreaking Large Language Models in Few   Queries via Disguise and Reconstruction

Tong Liu; Yingjie Zhang; Zhe Zhao; Yinpeng Dong; Guozhu Meng; Kai Chen

arXiv:2402.18104·cs.CR·June 11, 2024·5 cites

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, Kai Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces DRA, a black-box attack method that exploits bias vulnerabilities in large language models to generate harmful responses, achieving high success rates across multiple models including GPT-4.

Contribution

It presents a novel theoretical framework and a practical attack method for jailbreaking LLMs by disguising and reconstructing harmful instructions.

Findings

01

DRA achieves a 91.1% success rate on GPT-4.

02

Effective across open-source and closed-source models.

03

State-of-the-art attack efficiency demonstrated.

Abstract

In recent years, large language models (LLMs) have demonstrated notable success across various tasks, but the trustworthiness of LLMs is still an open problem. One specific threat is the potential to generate toxic or harmful responses. Attackers can craft adversarial prompts that induce harmful responses from LLMs. In this work, we pioneer a theoretical foundation in LLMs security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named DRA (Disguise and Reconstruction Attack), which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. We evaluate DRA across various open-source and closed-source models, showcasing state-of-the-art jailbreak success rates and attack efficiency. Notably, DRA boasts a 91.1% attack success rate on OpenAI GPT-4…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llm-dra/dra
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Dropout · Multi-Head Attention · Softmax · Dense Connections · Label Smoothing · Adam