Me, Myself, and $\pi$ : Evaluating and Explaining LLM Introspection
Atharv Naphade, Samarth Bhargav, Sean Lim, Mcnair Shah

TL;DR
This paper introduces a formal framework and evaluation suite for assessing genuine introspection in large language models, revealing their ability to access and predict their own policies without explicit training.
Contribution
It formalizes LLM introspection as latent computation over policies and parameters, and provides a new benchmark to distinguish true meta-cognition from superficial self-simulation.
Findings
Frontier models access their own policies better than peers.
Models outperform in predicting their own behavior.
Causal evidence explains emergence of introspection via attention diffusion.
Abstract
A hallmark of human intelligence is Introspection-the ability to assess and reason about one's own cognitive processes. Introspection has emerged as a promising but contested capability in large language models (LLMs). However, current evaluations often fail to distinguish genuine meta-cognition from the mere application of general world knowledge or text-based self-simulation. In this work, we propose a principled taxonomy that formalizes introspection as the latent computation of specific operators over a model's policy and parameters. To isolate the components of generalized introspection, we present Introspect-Bench, a multifaceted evaluation suite designed for rigorous capability testing. Our results show that frontier models exhibit privileged access to their own policies, outperforming peer models in predicting their own behavior. Furthermore, we provide causal, mechanistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling
