Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

Yu Chen; Yuanhao Liu; Qi Cao

arXiv:2605.08878·cs.CR·May 12, 2026

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

Yu Chen, Yuanhao Liu, Qi Cao

PDF

TL;DR

This paper investigates why aligned large language models remain vulnerable to jailbreaks by analyzing structural vulnerabilities and the role of Refusal-Escape Directions (RED) in model responses.

Contribution

It introduces the concept of RED, decomposes it into operator-level sources, and highlights the safety-utility trade-off in eliminating jailbreak vulnerabilities.

Findings

01

RED directions can be exposed by added token dimensions.

02

Successful jailbreaks are associated with shifts towards terminal-source contributions.

03

Eliminating RED requires balancing safety mechanisms and model utility.

Abstract

Aligned large language models (LLMs) remain vulnerable to jailbreak attacks. Recent mechanistic studies have identified latent features and representation shifts associated with jailbreak success, but they leave a more fundamental question open: why do aligned LLMs remain jailbreakable, and what structural vulnerabilities in the model make this possible? We study this question through a continuous input-transformation view. Our theoretical finding is that aligned models can still exhibit Refusal-Escape Directions (RED): local perturbation directions around a harmful input that shift the model's behavior from refusal to answering while preserving the model's harmful-semantics interpretation. From this perspective, a jailbreak is not only a successful discrete prompt construction, but can also be understood as a refusal-to-answer behavior transition induced by continuously perturbing a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.