Understanding (Un)Reliability of Steering Vectors in Language Models

Joschka Braun; Carsten Eickhoff; David Krueger; Seyed Ali Bahrainian; Dmitrii Krasheninnikov

arXiv:2505.22637·cs.LG·May 29, 2025

Understanding (Un)Reliability of Steering Vectors in Language Models

Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, Dmitrii Krasheninnikov

PDF

Open Access

TL;DR

This paper investigates the reliability of steering vectors in language models, revealing that their effectiveness varies with prompt types and activation geometry, and is less reliable when target behaviors lack coherent directions.

Contribution

The study provides a detailed analysis of factors affecting steering vector reliability, highlighting the importance of activation similarity and dataset structure for effective control.

Findings

01

All prompt types produce positive effects but with high variance.

02

Higher cosine similarity predicts more effective steering.

03

Datasets with better separation of activation signs are more steerable.

Abstract

Steering vectors are a lightweight method to control language model behavior by adding a learned bias to the activations at inference time. Although steering demonstrates promising performance, recent work shows that it can be unreliable or even counterproductive in some cases. This paper studies the influence of prompt types and the geometry of activation differences on steering reliability. First, we find that all seven prompt types used in our experiments produce a net positive steering effect, but exhibit high variance across samples, and often give an effect opposite of the desired one. No prompt type clearly outperforms the others, and yet the steering vectors resulting from the different prompt types often differ directionally (as measured by cosine similarity). Second, we show that higher cosine similarity between training set activation differences predicts more effective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Language and cultural evolution