Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Jinman Wu; Yi Xie; Shen Lin; Shiqian Zhao; Xiaofeng Chen

arXiv:2603.05773·cs.CR·March 16, 2026

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Jinman Wu, Yi Xie, Shen Lin, Shiqian Zhao, Xiaofeng Chen

PDF

Open Access

TL;DR

This paper explores the geometric structure of safety mechanisms in large language models, revealing a disentangled two-part system for recognizing harmful content and deciding to act, enabling new attack strategies and architectural insights.

Contribution

It introduces the Disentangled Safety Hypothesis, geometric analysis of safety signals, and novel methods for causal dissociation and safety mechanism manipulation in language models.

Findings

01

Discovered a universal transition from entangled to independent safety signals across model layers.

02

Developed the Double-Difference Extraction and Adaptive Causal Steering techniques.

03

Achieved state-of-the-art success in safety mechanism attack via the Refusal Erasure Attack.

Abstract

Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the \textbf{\underline{D}}isentangled \textbf{\underline{S}}afety \textbf{\underline{H}}ypothesis \textbf{(DSH)}, positing that safety computation operates on two distinct subspaces: a \textit{Recognition Axis} ( $v_{H}$ , ``Knowing'') and an \textit{Execution Axis} ( $v_{R}$ , ``Acting''). Our geometric analysis reveals a universal ``Reflex-to-Dissociation'' evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce \textit{Double-Difference Extraction} and \textit{Adaptive Causal Steering}. Using our curated \textsc{AmbiguityBench}, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques