Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models   and Their Defenses

Xiaosen Zheng; Tianyu Pang; Chao Du; Qian Liu; Jing Jiang; Min Lin

arXiv:2406.01288·cs.CL·October 31, 2024·2 cites

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that simple improved few-shot techniques can effectively jailbreak aligned large language models, bypassing advanced defenses with high success rates, highlighting vulnerabilities in current alignment methods.

Contribution

It introduces simple yet effective techniques like special system tokens and demo-level random search to significantly improve few-shot jailbreak success against aligned LLMs.

Findings

01

Achieves >80% ASR on Llama models without restarts

02

Nearly 100% ASR against various defenses and models

03

Effective even with strong model defenses like perplexity detection

Abstract

Recently, Anil et al. (2024) show that many-shot (up to hundreds of) demonstrations can jailbreak state-of-the-art LLMs by exploiting their long-context capability. Nevertheless, is it possible to use few-shot demonstrations to efficiently jailbreak LLMs within limited context sizes? While the vanilla few-shot jailbreaking may be inefficient, we propose improved techniques such as injecting special system tokens like [/INST] and employing demo-level random search from a collected demo pool. These simple techniques result in surprisingly effective jailbreaking against aligned LLMs (even with advanced defenses). For examples, our method achieves >80% (mostly >95%) ASRs on Llama-2-7B and Llama-3-8B without multiple restarts, even if the models are enhanced by strong defenses such as perplexity detection and/or SmoothLLM, which is challenging for suffix-based jailbreaking. In addition, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sail-sg/i-fsj
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Natural Language Processing Techniques

MethodsRandom Search