Predicting Where Steering Vectors Succeed

Jayadev Billa

arXiv:2604.15557·cs.LG·April 20, 2026

Predicting Where Steering Vectors Succeed

Jayadev Billa

PDF

TL;DR

The paper introduces the Linear Accessibility Profile (LAP), a diagnostic tool that predicts the success of steering vectors in language models without training, based on a new measure called $A_{lin}$.

Contribution

It proposes LAP and $A_{lin}$ as effective, training-free predictors of steering vector success across multiple models and concepts.

Findings

01

$A_{lin}$ predicts steering effectiveness with high correlation ($\rho$=+0.86 to +0.91).

02

Layer selection based on LAP improves steering success compared to standard heuristics.

03

A three-regime framework explains when different steering methods are effective.

Abstract

Steering vectors work for some concepts and layers but fail for others, and practitioners have no way to predict which setting applies before running an intervention. We introduce the Linear Accessibility Profile (LAP), a per-layer diagnostic that repurposes the logit lens as a predictor of steering vector effectiveness. The key measure, $A_{lin}$ , applies the model's unembedding matrix to intermediate hidden states, requiring no training. Across 24 controlled binary concept families on five models (Pythia-2.8B to Llama-8B), peak $A_{lin}$ predicts steering effectiveness at $ρ = + 0.86$ to $+ 0.91$ and layer selection at $ρ = + 0.63$ to $+ 0.92$ . A three-regime framework explains when difference-of-means steering works, when nonlinear methods are needed, and when no method can work. An entity-steering demo confirms the prediction end-to-end: steering at the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.