Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

Kejia Chen; Jiawen Zhang; Boheng Li; Pengcheng Li; Jian Lou; Zunlei Feng; Mingli Song; Ruoxi Jia; Tianwei Zhang

arXiv:2605.08277·cs.CR·May 12, 2026

Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

Kejia Chen, Jiawen Zhang, Boheng Li, Pengcheng Li, Jian Lou, Zunlei Feng, Mingli Song, Ruoxi Jia, Tianwei Zhang

PDF

1 Repo

TL;DR

This paper analyzes how many-shot jailbreaking attacks weaken safety in language models through representation drift and proposes a simple one-shot safety demonstration at inference to counteract this, improving robustness.

Contribution

It introduces a novel understanding of MSJ as implicit malicious fine-tuning and presents a straightforward inference-time method to mitigate such attacks.

Findings

01

Adding a single safety demonstration at inference restores refusal behavior.

02

The method does not require model fine-tuning or white-box access.

03

Empirically improves robustness against many-shot jailbreaking attacks.

Abstract

Many-shot jailbreaking (MSJ) causes safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations. We study why this attack becomes stronger as the number of demonstrations increases. Empirically, we find that MSJ induces a progressive activation drift: the representation of a fixed harmful query moves step by step away from the safety-aligned region as more harmful demonstrations are added. Theoretically, we show that this drift can be interpreted as implicit malicious fine-tuning: conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on the corresponding N harmful samples. This view turns the attack mechanism into a defense principle. We append a fixed one-shot safety demonstration at inference time, which induces a counteracting safety-oriented update and restores refusal behavior. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Thecommonirin/SafeEnd
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.