PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips

Zachary Coalson; Jeonghyun Woo; Chris S. Lin; Joyce Qu; Yu Sun; Shiyang Chen; Lishan Yang; Gururaj Saileshwar; Prashant Nair; Bo Fang; and Sanghyun Hong

arXiv:2412.07192·cs.CR·October 6, 2025

PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips

Zachary Coalson, Jeonghyun Woo, Chris S. Lin, Joyce Qu, Yu Sun, Shiyang Chen, Lishan Yang, Gururaj Saileshwar, Prashant Nair, Bo Fang, and Sanghyun Hong

PDF

Open Access

TL;DR

This paper introduces a novel, efficient method to jailbreak large language models by flipping only 5 to 25 bits in their parameters, bypassing safety measures and enabling harmful outputs without input modifications.

Contribution

The authors present a new, fast bit-flip attack technique that significantly reduces the number of required bit-flips to jailbreak large language models compared to prior methods.

Findings

01

Achieved high attack success rates of 80-98% on 10 open-source LLMs.

02

Successfully exploited models using Rowhammer-based fault injection with 69-91% success.

03

Identified model components and conditions that influence vulnerability to bit-flip attacks.

Abstract

We study a new vulnerability in commercial-scale safety-aligned large language models (LLMs): their refusal to generate harmful responses can be broken by flipping only a few bits in model parameters. Our attack jailbreaks billion-parameter language models with just 5 to 25 bit-flips, requiring up to 40 $\times$ fewer bit flips than prior attacks on much smaller computer vision models. Unlike prompt-based jailbreaks, our method directly uncensors models in memory at runtime, enabling harmful outputs without requiring input-level modifications. Our key innovation is an efficient bit-selection algorithm that identifies critical bits for language model jailbreaks up to 20 $\times$ faster than prior methods. We evaluate our attack on 10 open-source LLMs, achieving high attack success rates (ASRs) of 80-98% with minimal impact on model utility. We further demonstrate an end-to-end exploit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Hate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning