On the Non-Identifiability of Steering Vectors in Large Language Models

Sohan Venkatesh; Ashish Mahendran Kurapath

arXiv:2602.06801·cs.LG·April 2, 2026

On the Non-Identifiability of Steering Vectors in Large Language Models

Sohan Venkatesh, Ashish Mahendran Kurapath

PDF

TL;DR

This paper demonstrates that steering vectors in large language models are fundamentally non-identifiable, challenging assumptions about interpretability and the uniqueness of internal representations.

Contribution

It provides theoretical and empirical evidence that steering directions are non-identifiable due to large equivalence classes, affecting interpretability methods.

Findings

01

Orthogonal perturbations have similar effects across models and traits.

02

Null-space dimensionality is estimated via SVD of activation covariance matrices.

03

Non-identifiability persists across diverse prompt distributions.

Abstract

Activation steering methods are widely used to control large language model (LLM) behavior and are often interpreted as revealing meaningful internal representations. This interpretation assumes that steering directions are identifiable and uniquely recoverable from input-output behavior. We show that, under white-box single-layer access, steering vectors are fundamentally non-identifiable due to large equivalence classes of behaviorally indistinguishable interventions. Empirically, we find that orthogonal perturbations achieve near-equivalent efficacy with negligible effect sizes across multiple models and traits, with pre-trained semantic classifiers confirming equivalence at the output level. We estimate null-space dimensionality via SVD of activation covariance matrices and validate that equivalence holds robustly throughout the operationally relevant steering range. Critically, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.