Privileged Self-Access Matters for Introspection in AI
Siyuan Song, Harvey Lederman, Jennifer Hu, Kyle Mahowald

TL;DR
This paper explores the concept of introspection in AI, proposing a more rigorous definition and demonstrating that current models may not genuinely meet this standard despite superficial indications.
Contribution
It introduces a new, thicker definition of AI introspection and evaluates whether large language models truly possess it, challenging prior assumptions.
Findings
LLMs can appear to have lightweight introspection
Models often fail to meet the proposed rigorous introspection criteria
Introspection in AI requires more reliable internal state information than previously assumed
Abstract
Whether AI models can introspect is an increasingly important practical question. But there is no consensus on how introspection is to be defined. Beginning from a recently proposed ''lightweight'' definition, we argue instead for a thicker one. According to our proposal, introspection in AI is any process which yields information about internal states through a process more reliable than one with equal or lower computational cost available to a third party. Using experiments where LLMs reason about their internal temperature parameters, we show they can appear to have lightweight introspection while failing to meaningfully introspect per our proposed definition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOnline Learning and Analytics · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
