Jailbreaking LLMs via Calibration

Yuxuan Lu; Yongkang Guo; Yuqing Kong

arXiv:2602.00619·cs.CL·February 3, 2026

Jailbreaking LLMs via Calibration

Yuxuan Lu, Yongkang Guo, Yuqing Kong

PDF

Open Access

TL;DR

This paper introduces a novel framework for understanding and improving jailbreaking techniques in LLMs by modeling safety alignment as a systematic distortion and proposing optimal aggregation strategies, resulting in more effective jailbreaks.

Contribution

It presents a unified framework for jailbreaking LLMs through forecast aggregation, extending beyond logit-arithmetic methods, and introduces a new hybrid aggregation rule for better attack success.

Findings

01

Achieves higher attack success rates on frontier models.

02

Reduces 'Jailbreak Tax' compared to existing methods.

03

Effective across multiple benchmarks and tasks.

Abstract

Safety alignment in Large Language Models (LLMs) often creates a systematic discrepancy between a model's aligned output and the underlying pre-aligned data distribution. We propose a framework in which the effect of safety alignment on next-token prediction is modeled as a systematic distortion of a pre-alignment distribution. We cast Weak-to-Strong Jailbreaking as a forecast aggregation problem and derive an optimal aggregation strategy characterized by a Gradient Shift in the loss-induced dual space. We show that logit-arithmetic jailbreaking methods are a special case of this framework under cross-entropy loss, and derive a broader family of aggregation rules corresponding to other proper losses. We also propose a new hybrid aggregation rule. Evaluations across red-teaming benchmarks and math utility tasks using frontier models demonstrate that our approach achieves superior Attack…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education