Improving Instruction-Following in Language Models through Activation   Steering

Alessandro Stolfo; Vidhisha Balachandran; Safoora Yousefi; Eric; Horvitz; Besmira Nushi

arXiv:2410.12877·cs.CL·April 15, 2025

Improving Instruction-Following in Language Models through Activation Steering

Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric, Horvitz, Besmira Nushi

PDF

Open Access

TL;DR

This paper introduces activation steering, a method that uses instruction-specific activation vectors to improve language models' adherence to constraints and instructions during inference, enhancing control and transferability.

Contribution

The paper presents a novel activation steering technique that enables modular, inference-time control of language models based on instruction-specific activation vectors, including compositional and transfer capabilities.

Findings

01

Activation vectors improve model adherence to constraints.

02

Steering enables control without explicit instructions.

03

Transferability enhances base model performance.

Abstract

The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models and use them to steer models accordingly. These vectors are computed as the difference in activations between inputs with and without instructions, enabling a modular approach to activation steering. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion, providing inference-time control over instruction following. Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present. Additionally, we explore the compositionality of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning · Speech and dialogue systems

MethodsBalanced Selection