Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off
Yu Chen, Yuanhao Liu, Qi Cao

TL;DR
This paper investigates why aligned large language models remain vulnerable to jailbreaks by analyzing structural vulnerabilities and the role of Refusal-Escape Directions (RED) in model responses.
Contribution
It introduces the concept of RED, decomposes it into operator-level sources, and highlights the safety-utility trade-off in eliminating jailbreak vulnerabilities.
Findings
RED directions can be exposed by added token dimensions.
Successful jailbreaks are associated with shifts towards terminal-source contributions.
Eliminating RED requires balancing safety mechanisms and model utility.
Abstract
Aligned large language models (LLMs) remain vulnerable to jailbreak attacks. Recent mechanistic studies have identified latent features and representation shifts associated with jailbreak success, but they leave a more fundamental question open: why do aligned LLMs remain jailbreakable, and what structural vulnerabilities in the model make this possible? We study this question through a continuous input-transformation view. Our theoretical finding is that aligned models can still exhibit Refusal-Escape Directions (RED): local perturbation directions around a harmful input that shift the model's behavior from refusal to answering while preserving the model's harmful-semantics interpretation. From this perspective, a jailbreak is not only a successful discrete prompt construction, but can also be understood as a refusal-to-answer behavior transition induced by continuously perturbing a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
