Mechanisms of Introspective Awareness

Uzay Macar; Li Yang; Atticus Wang; Peter Wallich; Emmanuel Ameisen; and Jack Lindsey

arXiv:2603.21396·cs.LG·May 18, 2026

Mechanisms of Introspective Awareness

Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, and Jack Lindsey

PDF

1 Repo

TL;DR

This paper investigates the mechanisms behind large language models' ability to detect injected steering vectors, revealing a robust, multi-layer circuit that can be enhanced to improve introspective awareness.

Contribution

It uncovers the specific circuit mechanisms enabling introspective detection in models, highlighting the role of post-training algorithms and potential for amplification.

Findings

01

Detection is behaviorally robust with 0% false positives.

02

Preference optimization algorithms like DPO elicit detection, unlike supervised finetuning.

03

Detection involves a two-stage circuit with evidence carrier and gate features.

Abstract

Recent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept -- a phenomenon termed "introspective awareness." We investigate the mechanisms underlying this capability in open-weights models. First, we find that it is behaviorally robust: models detect injected steering vectors at moderate rates with 0% false positives across diverse prompts and dialogue formats. Notably, this capability emerges specifically from post-training; we show that preference optimization algorithms like DPO can elicit it, but standard supervised finetuning does not. We provide evidence that detection cannot be explained by simple linear association between certain steering vectors and directions promoting affirmative responses. We trace the detection mechanism to a two-stage circuit in which "evidence carrier" features in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

safety-research/introspection-mechanisms
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.