Steered LLM Activations are Non-Surjective

Aayush Mishra; Daniel Khashabi; Anqi Liu

arXiv:2604.09839·cs.AI·May 11, 2026

Steered LLM Activations are Non-Surjective

Aayush Mishra, Daniel Khashabi, Anqi Liu

PDF

1 Repo

TL;DR

This paper demonstrates that activation steering in large language models cannot generally be replicated by any prompt, highlighting a fundamental difference between white-box control and black-box prompting.

Contribution

It provides a formal proof that activation steering pushes model states off the prompt-reachable manifold, establishing a separation between steerability and prompt-based interpretability.

Findings

01

Activation steering often leads to states unreachable by prompts.

02

Empirical evidence across three LLMs supports the theoretical result.

03

Steerability and prompt-based control are fundamentally different.

Abstract

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations) and safety research (e.g., jailbreakability). However, it is unclear whether steered behavior is realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a preimage under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aamixsh/invertsteer
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.