The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Tom Wollschl\"ager; Jannes Elstner; Simon Geisler; Vincent Cohen-Addad; Stephan G\"unnemann; Johannes Gasteiger

arXiv:2502.17420·cs.LG·February 10, 2026

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Tom Wollschl\"ager, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan G\"unnemann, Johannes Gasteiger

PDF

Open Access 1 Datasets

TL;DR

This paper explores the complex spatial structures behind refusal behaviors in large language models, revealing multiple independent mechanisms and proposing a new framework for understanding their representational independence.

Contribution

It introduces a gradient-based method to identify multiple independent refusal directions and concept cones, advancing understanding of LLM refusal mechanisms.

Findings

01

Multiple independent refusal directions identified

02

Refusal mechanisms governed by complex spatial structures

03

Functionally independent directions confirmed

Abstract

The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

bedderautomation/refusal-geometry-qwen25-3b
dataset· 39 dl
39 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling