Jailbreak Attack Initializations as Extractors of Compliance Directions
Amit Levi, Rom Himelstein, Yaniv Nemcovsky, Avi Mendelson, Chaim Baskin

TL;DR
This paper reveals that jailbreak attacks on safety-aligned LLMs converge to a single compliance direction, and introduces CRI, a framework that improves attack success rates by projecting prompts along these directions.
Contribution
The work uncovers the convergence of gradient-based jailbreak attacks to a compliance direction and proposes CRI, a novel initialization method to enhance attack effectiveness.
Findings
CRI increases attack success rate across models and datasets.
Attacks converge to a single compliance direction.
CRI reduces computational overhead of jailbreak attacks.
Abstract
Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSecurity and Verification in Computing · Advanced Malware Detection Techniques · Adversarial Robustness in Machine Learning
