Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs
Utsav Maskey, Mark Dras, Usman Naseem

TL;DR
This paper investigates the geometric structure of refusal behaviors in aligned language models, revealing that harmful and over-refusal directions are distinct and require task-specific interventions.
Contribution
It provides a mechanistic analysis showing that harmful and over-refusal directions are geometrically distinct, emphasizing the need for task-specific interventions.
Findings
Harmful-refusal directions are task-agnostic and captured by a single vector.
Over-refusal directions are task-dependent and reside within benign task-representations.
Linear probing shows refusal types are represented in early transformer layers.
Abstract
Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understand why this happens. We show that harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace. Linear probing confirms that the two refusal types are representationally distinct from the early…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
