Weight Updates as Activation Shifts: A Principled Framework for Steering
Dyah Adila, John Cooper, Alexander Yun, Avi Trost, Frederic Sala

TL;DR
This paper establishes a theoretical framework linking activation steering and weight updates, enabling more efficient model adaptation that outperforms prior methods with minimal parameter changes.
Contribution
It introduces a principled, theoretically-backed framework for activation steering, proposes joint adaptation in activation and weight spaces, and demonstrates superior efficiency and performance.
Findings
Post-block steering achieves near full-tuning accuracy with only 0.04% parameters.
Joint adaptation surpasses isolated weight or activation updates.
Theoretical equivalence guides optimal intervention locations.
Abstract
Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices -- such as intervention location and parameterization -- that currently rely on empirical heuristics rather than a principled foundation. We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior. This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site. We further explain why certain intervention locations outperform others and show that weight updates and activation updates play distinct, complementary functional roles. This analysis motivates a new approach -- joint adaptation -- that trains in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Robot Manipulation and Learning · Interactive and Immersive Displays
