On the Non-Identifiability of Steering Vectors in Large Language Models
Sohan Venkatesh, Ashish Mahendran Kurapath

TL;DR
This paper demonstrates that steering vectors in large language models are fundamentally non-identifiable, challenging assumptions about interpretability and the uniqueness of internal representations.
Contribution
It provides theoretical and empirical evidence that steering directions are non-identifiable due to large equivalence classes, affecting interpretability methods.
Findings
Orthogonal perturbations have similar effects across models and traits.
Null-space dimensionality is estimated via SVD of activation covariance matrices.
Non-identifiability persists across diverse prompt distributions.
Abstract
Activation steering methods are widely used to control large language model (LLM) behavior and are often interpreted as revealing meaningful internal representations. This interpretation assumes that steering directions are identifiable and uniquely recoverable from input-output behavior. We show that, under white-box single-layer access, steering vectors are fundamentally non-identifiable due to large equivalence classes of behaviorally indistinguishable interventions. Empirically, we find that orthogonal perturbations achieve near-equivalent efficacy with negligible effect sizes across multiple models and traits, with pre-trained semantic classifiers confirming equivalence at the output level. We estimate null-space dimensionality via SVD of activation covariance matrices and validate that equivalence holds robustly throughout the operationally relevant steering range. Critically, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
