What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Stephen Cheng; Sarah Wiegreffe; Dinesh Manocha

arXiv:2604.08524·cs.LG·April 10, 2026

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Stephen Cheng, Sarah Wiegreffe, Dinesh Manocha

PDF

TL;DR

This paper investigates how steering vectors influence large language models by analyzing internal mechanisms, revealing key circuits and enabling significant sparsification while maintaining performance.

Contribution

It introduces a multi-token activation patching framework and uncovers the primary interaction of steering vectors with the attention mechanism, providing interpretability and sparsification insights.

Findings

01

Steering vectors mainly interact with the attention OV circuit.

02

Freezing attention scores reduces performance by only 8.75%.

03

Steering vectors can be sparsified by 90-99% without major performance loss.

Abstract

Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what internal mechanisms steering vectors affect and how this results in different model outputs. To investigate the causal mechanisms underlying the effectiveness of steering vectors, we conduct a comprehensive case study on refusal. We propose a multi-token activation patching framework and discover that different steering methodologies leverage functionally interchangeable circuits when applied at the same layer. These circuits reveal that steering vectors primarily interact with the attention mechanism through the OV circuit while largely ignoring the QK circuit-- freezing all attention scores during steering drops performance by only 8.75% across two model families. A mathematical decomposition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.