What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
Stephen Cheng, Sarah Wiegreffe, Dinesh Manocha

TL;DR
This paper investigates how steering vectors influence large language models by analyzing internal mechanisms, revealing key circuits and enabling significant sparsification while maintaining performance.
Contribution
It introduces a multi-token activation patching framework and uncovers the primary interaction of steering vectors with the attention mechanism, providing interpretability and sparsification insights.
Findings
Steering vectors mainly interact with the attention OV circuit.
Freezing attention scores reduces performance by only 8.75%.
Steering vectors can be sparsified by 90-99% without major performance loss.
Abstract
Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what internal mechanisms steering vectors affect and how this results in different model outputs. To investigate the causal mechanisms underlying the effectiveness of steering vectors, we conduct a comprehensive case study on refusal. We propose a multi-token activation patching framework and discover that different steering methodologies leverage functionally interchangeable circuits when applied at the same layer. These circuits reveal that steering vectors primarily interact with the attention mechanism through the OV circuit while largely ignoring the QK circuit-- freezing all attention scores during steering drops performance by only 8.75% across two model families. A mathematical decomposition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
