Jailbreaking LLMs via Calibration
Yuxuan Lu, Yongkang Guo, Yuqing Kong

TL;DR
This paper introduces a novel framework for understanding and improving jailbreaking techniques in LLMs by modeling safety alignment as a systematic distortion and proposing optimal aggregation strategies, resulting in more effective jailbreaks.
Contribution
It presents a unified framework for jailbreaking LLMs through forecast aggregation, extending beyond logit-arithmetic methods, and introduces a new hybrid aggregation rule for better attack success.
Findings
Achieves higher attack success rates on frontier models.
Reduces 'Jailbreak Tax' compared to existing methods.
Effective across multiple benchmarks and tasks.
Abstract
Safety alignment in Large Language Models (LLMs) often creates a systematic discrepancy between a model's aligned output and the underlying pre-aligned data distribution. We propose a framework in which the effect of safety alignment on next-token prediction is modeled as a systematic distortion of a pre-alignment distribution. We cast Weak-to-Strong Jailbreaking as a forecast aggregation problem and derive an optimal aggregation strategy characterized by a Gradient Shift in the loss-induced dual space. We show that logit-arithmetic jailbreaking methods are a special case of this framework under cross-entropy loss, and derive a broader family of aggregation rules corresponding to other proper losses. We also propose a new hybrid aggregation rule. Evaluations across red-teaming benchmarks and math utility tasks using frontier models demonstrate that our approach achieves superior Attack…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education
