There Is More to Refusal in Large Language Models than a Single Direction
Faaiz Joad, Majd Hawasly, Sabri Boughorbel, Nadir Durrani, Husrev Taha Sencar

TL;DR
This paper reveals that refusal behaviors in large language models are governed by multiple distinct directions in activation space, each influencing the style of refusal rather than its occurrence, challenging the idea of a single controlling direction.
Contribution
It demonstrates that refusal in language models involves multiple geometrically distinct directions, providing a nuanced understanding beyond the single-direction hypothesis.
Findings
Refusal behaviors correspond to multiple distinct directions in activation space.
Linear steering along any refusal-related direction yields similar refusal trade-offs.
Different directions influence how the model refuses, not whether it refuses.
Abstract
Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Neurobiology of Language and Bilingualism
