There Is More to Refusal in Large Language Models than a Single Direction

Faaiz Joad; Majd Hawasly; Sabri Boughorbel; Nadir Durrani; Husrev Taha Sencar

arXiv:2602.02132·cs.CL·February 3, 2026

There Is More to Refusal in Large Language Models than a Single Direction

Faaiz Joad, Majd Hawasly, Sabri Boughorbel, Nadir Durrani, Husrev Taha Sencar

PDF

Open Access

TL;DR

This paper reveals that refusal behaviors in large language models are governed by multiple distinct directions in activation space, each influencing the style of refusal rather than its occurrence, challenging the idea of a single controlling direction.

Contribution

It demonstrates that refusal in language models involves multiple geometrically distinct directions, providing a nuanced understanding beyond the single-direction hypothesis.

Findings

01

Refusal behaviors correspond to multiple distinct directions in activation space.

02

Linear steering along any refusal-related direction yields similar refusal trade-offs.

03

Different directions influence how the model refuses, not whether it refuses.

Abstract

Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Neurobiology of Language and Bilingualism