PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips
Zachary Coalson, Jeonghyun Woo, Chris S. Lin, Joyce Qu, Yu Sun, Shiyang Chen, Lishan Yang, Gururaj Saileshwar, Prashant Nair, Bo Fang, and Sanghyun Hong

TL;DR
This paper introduces a novel, efficient method to jailbreak large language models by flipping only 5 to 25 bits in their parameters, bypassing safety measures and enabling harmful outputs without input modifications.
Contribution
The authors present a new, fast bit-flip attack technique that significantly reduces the number of required bit-flips to jailbreak large language models compared to prior methods.
Findings
Achieved high attack success rates of 80-98% on 10 open-source LLMs.
Successfully exploited models using Rowhammer-based fault injection with 69-91% success.
Identified model components and conditions that influence vulnerability to bit-flip attacks.
Abstract
We study a new vulnerability in commercial-scale safety-aligned large language models (LLMs): their refusal to generate harmful responses can be broken by flipping only a few bits in model parameters. Our attack jailbreaks billion-parameter language models with just 5 to 25 bit-flips, requiring up to 40 fewer bit flips than prior attacks on much smaller computer vision models. Unlike prompt-based jailbreaks, our method directly uncensors models in memory at runtime, enabling harmful outputs without requiring input-level modifications. Our key innovation is an efficient bit-selection algorithm that identifies critical bits for language model jailbreaks up to 20 faster than prior methods. We evaluate our attack on 10 open-source LLMs, achieving high attack success rates (ASRs) of 80-98% with minimal impact on model utility. We further demonstrate an end-to-end exploit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Hate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning
