TL;DR
This paper analyzes how many-shot jailbreaking attacks weaken safety in language models through representation drift and proposes a simple one-shot safety demonstration at inference to counteract this, improving robustness.
Contribution
It introduces a novel understanding of MSJ as implicit malicious fine-tuning and presents a straightforward inference-time method to mitigate such attacks.
Findings
Adding a single safety demonstration at inference restores refusal behavior.
The method does not require model fine-tuning or white-box access.
Empirically improves robustness against many-shot jailbreaking attacks.
Abstract
Many-shot jailbreaking (MSJ) causes safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations. We study why this attack becomes stronger as the number of demonstrations increases. Empirically, we find that MSJ induces a progressive activation drift: the representation of a fixed harmful query moves step by step away from the safety-aligned region as more harmful demonstrations are added. Theoretically, we show that this drift can be interpreted as implicit malicious fine-tuning: conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on the corresponding N harmful samples. This view turns the attack mechanism into a defense principle. We append a fixed one-shot safety demonstration at inference time, which induces a counteracting safety-oriented update and restores refusal behavior. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
