Jailbroken Frontier Models Retain Their Capabilities

Daniel Zhu; Zihan Wang; Xuchan Bao; Jerry Wei

arXiv:2605.00267·cs.LG·May 6, 2026

Jailbroken Frontier Models Retain Their Capabilities

Daniel Zhu, Zihan Wang, Xuchan Bao, Jerry Wei

PDF

TL;DR

This paper demonstrates that advanced jailbreaks against frontier language models cause minimal performance degradation, challenging the assumption that jailbreaks weaken models significantly.

Contribution

It reveals that the effectiveness of jailbreaks scales inversely with model capability and that the strongest jailbreaks do not substantially impair model performance.

Findings

01

Jailbreak tax decreases as model capability increases.

02

Reasoning tasks are more affected by jailbreaks than knowledge tasks.

03

Boundary Point Jailbreaking nearly perfectly evades classifiers with minimal model impact.

Abstract

As language model safeguards become more robust, attackers are pushed toward developing increasingly complex jailbreaks. Prior work has found that this complexity imposes a "jailbreak tax" that degrades the target model's task performance. We show that this tax scales inversely with model capability and that the most advanced jailbreaks effectively yield no reduction in model capabilities. Evaluating 28 jailbreaks on five benchmarks across Claude models ranging in capability from Haiku 4.5 to Opus 4.6, we find Haiku 4.5 loses an average of 33.1% on benchmark performance when jailbroken, while Opus 4.6 at max thinking effort loses only 7.7%. We also observe that across all models, reasoning-heavy tasks display considerably more degradation than knowledge-recall tasks. Finally, Boundary Point Jailbreaking, currently the strongest jailbreak against deployed classifiers, achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.