Activation Scaling for Steering and Interpreting Language Models

Niklas Stoehr; Kevin Du; V\'esteinn Sn{\ae}bjarnarson; Robert West,; Ryan Cotterell; Aaron Schein

arXiv:2410.04962·cs.CL·October 8, 2024

Activation Scaling for Steering and Interpreting Language Models

Niklas Stoehr, Kevin Du, V\'esteinn Sn{\ae}bjarnarson, Robert West,, Ryan Cotterell, Aaron Schein

PDF

Open Access

TL;DR

This paper introduces activation scaling as a minimal, interpretable method to steer language models by adjusting activation vector magnitudes, enabling better understanding of model internals and improving intervention precision.

Contribution

It proposes a novel activation scaling technique that effectively steers language models, offering a more minimal and interpretable alternative to existing methods.

Findings

01

Activation scaling performs comparably to steering vectors in effectiveness.

02

The method is more minimal, aiding interpretability.

03

Activation scalars can be learned as functions of activation vectors.

Abstract

Given the prompt "Rome is in", can we steer a language model to flip its prediction of an incorrect token "France" to a correct token "Italy" by only multiplying a few relevant activation vectors with scalars? We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. Concretely, we establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa (effectiveness), and leave other tokens unaffected (faithfulness), all while being sparse (minimality). Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention: activation scaling only modifies the signed magnitude of activation vectors to strengthen, weaken, or reverse the steering directions already encoded in the model. On synthetic tasks, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems

MethodsFLIP