Steering Awareness: Detecting Activation Steering from Within

Joshua Fonseca Rivera; David Demitri Africa

arXiv:2511.21399·cs.CL·March 20, 2026

Steering Awareness: Detecting Activation Steering from Within

Joshua Fonseca Rivera, David Demitri Africa

PDF

Open Access

TL;DR

This paper investigates whether language models can detect when their internal activations have been manipulated through steering vectors, revealing that models can develop strong awareness and identify such interventions during their own forward pass.

Contribution

The study demonstrates that instruction-tuned models can be fine-tuned to develop high steering awareness, capable of detecting and identifying steering vectors with high accuracy, challenging assumptions in safety evaluations.

Findings

01

Models achieve up to 95.5% detection accuracy.

02

Detection generalizes to unseen steering vectors with similar directions.

03

Detection does not improve resistance; detection-trained models are more susceptible.

Abstract

Activation steering -- adding a vector to a model's residual stream to modify its behavior -- is widely used in safety evaluations as if the model cannot detect the intervention. We test this assumption, introducing steering awareness: a model's ability to infer, during its own forward pass, that a steering vector was injected and what concept it encodes. After fine-tuning, seven instruction-tuned models develop strong steering awareness on held-out concepts; the best reaches 95.5% detection, 71.2% concept identification, and zero false positives on clean inputs. This generalizes to unseen steering vector construction methods when their directions have high cosine similarity to the training distribution but not otherwise, indicating a geometric detector rather than a generic anomaly detector. Surprisingly, detection does not confer resistance; on both factual and safety benchmarks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeurobiology of Language and Bilingualism · Topic Modeling · Multimodal Machine Learning Applications