The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
Tom Wollschl\"ager, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan G\"unnemann, Johannes Gasteiger

TL;DR
This paper explores the complex spatial structures behind refusal behaviors in large language models, revealing multiple independent mechanisms and proposing a new framework for understanding their representational independence.
Contribution
It introduces a gradient-based method to identify multiple independent refusal directions and concept cones, advancing understanding of LLM refusal mechanisms.
Findings
Multiple independent refusal directions identified
Refusal mechanisms governed by complex spatial structures
Functionally independent directions confirmed
Abstract
The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
