HyperSteer: Activation Steering at Scale with Hypernetworks
Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, Atticus Geiger

TL;DR
HyperSteer introduces a hypernetwork-based approach to generate activation steering vectors for language models, enabling scalable, effective control over text generation with strong performance even on unseen prompts.
Contribution
The paper presents HyperSteer, a novel hypernetwork architecture that efficiently produces activation steering vectors conditioned on prompts and model internals, surpassing existing methods in scalability and effectiveness.
Findings
Scaling HyperSteer with thousands of prompts outperforms state-of-the-art methods.
HyperSteer performs well on unseen prompts, demonstrating generalization.
HyperSteer matches the performance of steering-via-prompting.
Abstract
Steering language models (LMs) by modifying internal activations is a popular approach for controlling text generation. Unsupervised dictionary learning methods, e.g., sparse autoencoders, can be scaled to produce many steering vectors, but lack guarantees on the individual efficacy of each vector and control over the coverage of relevant steering tasks. In contrast, supervised methods for constructing steering vectors are targeted and effective, but require more data collection and training for each additional steering vector produced. In this work, we introduce HyperSteer, a family of hypernetwork-based architectures which are trained end-to-end to generate steering vectors conditioned on the natural language steering prompts and the internals of the steered LM. In our evaluations, we show that scaling HyperSteer with thousands of steering prompts exceeds the performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
