Latent Introspection: Models Can Detect Prior Concept Injections
Theia Pearson-Vogel, Martin Vanek, Raymond Douglas, Jan Kulveit

TL;DR
This paper shows that a large language model can detect and identify injected concepts in its context through latent analysis, revealing a surprising capacity for introspection that impacts understanding and safety.
Contribution
The study uncovers a latent introspective ability in a large language model to detect concept injections, enhanced by specific prompting, with implications for model transparency and safety.
Findings
Model detects concept injections in residual streams.
Prompting enhances detection sensitivity dramatically.
Mutual information between injected and recovered concepts increases.
Abstract
We uncover a latent capacity for introspection in a Qwen 32B model, demonstrating that the model can detect when concepts have been injected into its earlier context and identify which concept was injected. While the model denies injection in sampled outputs, logit lens analysis reveals clear detection signals in the residual stream, which are attenuated in the final layers. Furthermore, prompting the model with accurate information about AI introspection mechanisms can dramatically strengthen this effect: the sensitivity to injection increases massively (0.3% -> 39.9%) with only a 0.6% increase in false positives. Also, mutual information between nine injected and recovered concepts rises from 0.61 bits to 1.05 bits, ruling out generic noise explanations. Our results demonstrate models can have a surprising capacity for introspection and steering awareness that is easy to overlook,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling
