Jailbroken Frontier Models Retain Their Capabilities
Daniel Zhu, Zihan Wang, Xuchan Bao, Jerry Wei

TL;DR
This paper demonstrates that advanced jailbreaks against frontier language models cause minimal performance degradation, challenging the assumption that jailbreaks weaken models significantly.
Contribution
It reveals that the effectiveness of jailbreaks scales inversely with model capability and that the strongest jailbreaks do not substantially impair model performance.
Findings
Jailbreak tax decreases as model capability increases.
Reasoning tasks are more affected by jailbreaks than knowledge tasks.
Boundary Point Jailbreaking nearly perfectly evades classifiers with minimal model impact.
Abstract
As language model safeguards become more robust, attackers are pushed toward developing increasingly complex jailbreaks. Prior work has found that this complexity imposes a "jailbreak tax" that degrades the target model's task performance. We show that this tax scales inversely with model capability and that the most advanced jailbreaks effectively yield no reduction in model capabilities. Evaluating 28 jailbreaks on five benchmarks across Claude models ranging in capability from Haiku 4.5 to Opus 4.6, we find Haiku 4.5 loses an average of 33.1% on benchmark performance when jailbroken, while Opus 4.6 at max thinking effort loses only 7.7%. We also observe that across all models, reasoning-heavy tasks display considerably more degradation than knowledge-recall tasks. Finally, Boundary Point Jailbreaking, currently the strongest jailbreak against deployed classifiers, achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
