TL;DR
This paper investigates the mechanisms behind large language models' ability to detect injected steering vectors, revealing a robust, multi-layer circuit that can be enhanced to improve introspective awareness.
Contribution
It uncovers the specific circuit mechanisms enabling introspective detection in models, highlighting the role of post-training algorithms and potential for amplification.
Findings
Detection is behaviorally robust with 0% false positives.
Preference optimization algorithms like DPO elicit detection, unlike supervised finetuning.
Detection involves a two-stage circuit with evidence carrier and gate features.
Abstract
Recent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept -- a phenomenon termed "introspective awareness." We investigate the mechanisms underlying this capability in open-weights models. First, we find that it is behaviorally robust: models detect injected steering vectors at moderate rates with 0% false positives across diverse prompts and dialogue formats. Notably, this capability emerges specifically from post-training; we show that preference optimization algorithms like DPO can elicit it, but standard supervised finetuning does not. We provide evidence that detection cannot be explained by simple linear association between certain steering vectors and directions promoting affirmative responses. We trace the detection mechanism to a two-stage circuit in which "evidence carrier" features in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
