
TL;DR
The paper introduces the Linear Accessibility Profile (LAP), a diagnostic tool that predicts the success of steering vectors in language models without training, based on a new measure called $A_{lin}$.
Contribution
It proposes LAP and $A_{lin}$ as effective, training-free predictors of steering vector success across multiple models and concepts.
Findings
$A_{lin}$ predicts steering effectiveness with high correlation ($\rho$=+0.86 to +0.91).
Layer selection based on LAP improves steering success compared to standard heuristics.
A three-regime framework explains when different steering methods are effective.
Abstract
Steering vectors work for some concepts and layers but fail for others, and practitioners have no way to predict which setting applies before running an intervention. We introduce the Linear Accessibility Profile (LAP), a per-layer diagnostic that repurposes the logit lens as a predictor of steering vector effectiveness. The key measure, , applies the model's unembedding matrix to intermediate hidden states, requiring no training. Across 24 controlled binary concept families on five models (Pythia-2.8B to Llama-8B), peak predicts steering effectiveness at to and layer selection at to . A three-regime framework explains when difference-of-means steering works, when nonlinear methods are needed, and when no method can work. An entity-steering demo confirms the prediction end-to-end: steering at the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
