Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric, Horvitz, Besmira Nushi

TL;DR
This paper introduces activation steering, a method that uses instruction-specific activation vectors to improve language models' adherence to constraints and instructions during inference, enhancing control and transferability.
Contribution
The paper presents a novel activation steering technique that enables modular, inference-time control of language models based on instruction-specific activation vectors, including compositional and transfer capabilities.
Findings
Activation vectors improve model adherence to constraints.
Steering enables control without explicit instructions.
Transferability enhances base model performance.
Abstract
The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models and use them to steer models accordingly. These vectors are computed as the difference in activations between inputs with and without instructions, enabling a modular approach to activation steering. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion, providing inference-time control over instruction following. Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present. Additionally, we explore the compositionality of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning · Speech and dialogue systems
MethodsBalanced Selection
