Steering Awareness: Detecting Activation Steering from Within
Joshua Fonseca Rivera, David Demitri Africa

TL;DR
This paper investigates whether language models can detect when their internal activations have been manipulated through steering vectors, revealing that models can develop strong awareness and identify such interventions during their own forward pass.
Contribution
The study demonstrates that instruction-tuned models can be fine-tuned to develop high steering awareness, capable of detecting and identifying steering vectors with high accuracy, challenging assumptions in safety evaluations.
Findings
Models achieve up to 95.5% detection accuracy.
Detection generalizes to unseen steering vectors with similar directions.
Detection does not improve resistance; detection-trained models are more susceptible.
Abstract
Activation steering -- adding a vector to a model's residual stream to modify its behavior -- is widely used in safety evaluations as if the model cannot detect the intervention. We test this assumption, introducing steering awareness: a model's ability to infer, during its own forward pass, that a steering vector was injected and what concept it encodes. After fine-tuning, seven instruction-tuned models develop strong steering awareness on held-out concepts; the best reaches 95.5% detection, 71.2% concept identification, and zero false positives on clean inputs. This generalizes to unseen steering vector construction methods when their directions have high cosine similarity to the training distribution but not otherwise, indicating a geometric detector rather than a generic anomaly detector. Surprisingly, detection does not confer resistance; on both factual and safety benchmarks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Topic Modeling · Multimodal Machine Learning Applications
