Activation Scaling for Steering and Interpreting Language Models
Niklas Stoehr, Kevin Du, V\'esteinn Sn{\ae}bjarnarson, Robert West,, Ryan Cotterell, Aaron Schein

TL;DR
This paper introduces activation scaling as a minimal, interpretable method to steer language models by adjusting activation vector magnitudes, enabling better understanding of model internals and improving intervention precision.
Contribution
It proposes a novel activation scaling technique that effectively steers language models, offering a more minimal and interpretable alternative to existing methods.
Findings
Activation scaling performs comparably to steering vectors in effectiveness.
The method is more minimal, aiding interpretability.
Activation scalars can be learned as functions of activation vectors.
Abstract
Given the prompt "Rome is in", can we steer a language model to flip its prediction of an incorrect token "France" to a correct token "Italy" by only multiplying a few relevant activation vectors with scalars? We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. Concretely, we establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa (effectiveness), and leave other tokens unaffected (faithfulness), all while being sparse (minimality). Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention: activation scaling only modifies the signed magnitude of activation vectors to strengthen, weaken, or reverse the steering directions already encoded in the model. On synthetic tasks, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
MethodsFLIP
