From Weights to Activations: Is Steering the Next Frontier of Adaptation?
Simon Ostermann, Daniil Gurgurov, Tanja Baeumel, Michael A. Hedderich, Sebastian Lapuschkin, Wojciech Samek, Vera Schmitt

TL;DR
This paper redefines steering as a form of model adaptation, comparing it with traditional methods and proposing a unified framework based on targeted activation interventions for reversible behavioral changes.
Contribution
It introduces a functional criteria framework to analyze and position steering within the broader context of model adaptation methods.
Findings
Steering is characterized as a distinct, reversible adaptation paradigm.
A unified taxonomy for model adaptation is proposed.
Steering enables local behavioral changes without parameter updates.
Abstract
Post-training adaptation of language models is commonly achieved through parameter updates or input-based methods such as fine-tuning, parameter-efficient adaptation, and prompting. In parallel, a growing body of work modifies internal activations at inference time to influence model behavior, an approach known as steering. Despite increasing use, steering is rarely analyzed within the same conceptual framework as established adaptation methods. In this work, we argue that steering should be regarded as a form of model adaptation. We introduce a set of functional criteria for adaptation methods and use them to compare steering approaches with classical alternatives. This analysis positions steering as a distinct adaptation paradigm based on targeted interventions in activation space, enabling local and reversible behavioral change without parameter updates. The resulting framing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
