Emergent Introspective Awareness in Large Language Models
Jack Lindsey

TL;DR
This paper explores whether large language models can introspect on their internal states, demonstrating some ability to recognize injected concepts, recall internal representations, and modulate their activations, though these abilities are currently unreliable.
Contribution
It introduces a novel method for assessing model introspection by injecting known concepts into activations and measuring their influence on self-reports, revealing emergent introspective abilities.
Findings
Models can detect injected concepts in certain scenarios.
Models can recall and distinguish their internal representations from raw inputs.
Models can modulate their internal activations when instructed.
Abstract
We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications
