Emergent Introspection in AI is Content-Agnostic
Harvey Lederman, Kyle Mahowald

TL;DR
This paper investigates how AI models perform introspection, revealing that their introspective ability is content-agnostic and relies on confabulation of high-frequency concepts, aligning with psychological theories.
Contribution
It demonstrates that AI models' introspection is content-agnostic and confabulates high-frequency concepts, providing insights into the underlying mechanism.
Findings
Models detect anomalies even without content understanding.
Introspection relies on confabulation of high-frequency, concrete concepts.
Fewer tokens are needed for anomaly detection than for correct concept guessing.
Abstract
Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study the mechanism of this introspection. We first extensively replicate Lindsey (2025)'s thought injection detection paradigm in large open-source models. We show that introspection in these models is content-agnostic: models can detect that an anomaly occurred even when they cannot reliably identify its content. The models confabulate injected concepts that are high-frequency and concrete (e.g., "apple"). They also require fewer tokens to detect an injection than to guess the correct concept (with wrong guesses coming earlier). We argue that a content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
