Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models
Jinman Wu, Yi Xie, Shen Lin, Shiqian Zhao, Xiaofeng Chen

TL;DR
This paper explores the geometric structure of safety mechanisms in large language models, revealing a disentangled two-part system for recognizing harmful content and deciding to act, enabling new attack strategies and architectural insights.
Contribution
It introduces the Disentangled Safety Hypothesis, geometric analysis of safety signals, and novel methods for causal dissociation and safety mechanism manipulation in language models.
Findings
Discovered a universal transition from entangled to independent safety signals across model layers.
Developed the Double-Difference Extraction and Adaptive Causal Steering techniques.
Achieved state-of-the-art success in safety mechanism attack via the Refusal Erasure Attack.
Abstract
Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the \textbf{\underline{D}}isentangled \textbf{\underline{S}}afety \textbf{\underline{H}}ypothesis \textbf{(DSH)}, positing that safety computation operates on two distinct subspaces: a \textit{Recognition Axis} (, ``Knowing'') and an \textit{Execution Axis} (, ``Acting''). Our geometric analysis reveals a universal ``Reflex-to-Dissociation'' evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce \textit{Double-Difference Extraction} and \textit{Adaptive Causal Steering}. Using our curated \textsc{AmbiguityBench}, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques
