Steer Like the LLM: Activation Steering that Mimics Prompting
Geert Heyman, Frederik Vandeputte

TL;DR
This paper introduces Prompt Steering Replacement (PSR), a new activation steering method that mimics prompt-based steering by estimating token-specific coefficients, leading to improved control over language models.
Contribution
The paper formulates prompt steering as activation steering, revealing limitations of existing methods, and proposes PSR models that outperform current activation steering techniques.
Findings
PSR models outperform existing activation steering methods.
PSR achieves comparable or better results than prompting on benchmarks.
Activation steering methods often do not faithfully replicate prompt mechanics.
Abstract
Large language models can be steered at inference time through prompting or activation interventions, but activation steering methods often underperform compared to prompt-based approaches. We propose a framework that formulates prompt steering as a form of activation steering and investigates whether distilling successful prompt steering behavior into simpler, interpretable models can close this gap. Our analysis reveals that popular activation steering methods are not faithful to the mechanics of prompt steering, which applies strong interventions on some tokens while barely affecting others. Based on these insights, we introduce Prompt Steering Replacement (PSR) models that estimate token-specific steering coefficients from the activations themselves and are trained to imitate prompt-based interventions. Experiments on three steering benchmarks across multiple language models show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
