Activation Steering via Generative Causal Mediation

Aruna Sankaranarayanan; Amir Zur; Atticus Geiger; Dylan Hadfield-Menell

arXiv:2602.16080·cs.CL·April 2, 2026

Activation Steering via Generative Causal Mediation

Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell

PDF

TL;DR

This paper introduces Generative Causal Mediation (GCM), a method to identify and manipulate specific model components to control long-form language model behaviors, outperforming previous correlation-based approaches.

Contribution

GCM is a novel procedure that localizes and controls diffuse behaviors in language models by selecting influential model components based on contrastive response data.

Findings

01

GCM effectively localizes concepts in long-form responses.

02

GCM outperforms correlation-based baselines in steering tasks.

03

GCM successfully controls behaviors like refusal, sycophancy, and style transfer.

Abstract

Where should we intervene in a language model (LM) to localize and control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components (e.g., attention heads) from contrastive long-form responses, to steer such diffuse concepts (e.g., talk in verse vs. talk in prose). In GCM, we first construct a dataset of contrasting behavioral inputs and long-form responses. Then, we quantify how model components mediate the concept and select the strongest mediators for steering. We evaluate GCM on three behaviors--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.