Activation Steering via Generative Causal Mediation
Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell

TL;DR
This paper introduces Generative Causal Mediation (GCM), a method to identify and manipulate specific model components to control long-form language model behaviors, outperforming previous correlation-based approaches.
Contribution
GCM is a novel procedure that localizes and controls diffuse behaviors in language models by selecting influential model components based on contrastive response data.
Findings
GCM effectively localizes concepts in long-form responses.
GCM outperforms correlation-based baselines in steering tasks.
GCM successfully controls behaviors like refusal, sycophancy, and style transfer.
Abstract
Where should we intervene in a language model (LM) to localize and control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components (e.g., attention heads) from contrastive long-form responses, to steer such diffuse concepts (e.g., talk in verse vs. talk in prose). In GCM, we first construct a dataset of contrasting behavioral inputs and long-form responses. Then, we quantify how model components mediate the concept and select the strongest mediators for steering. We evaluate GCM on three behaviors--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
