Recognizing internal states in AI: evidence from patterned preferences in large language models
Annika Hedberg

TL;DR
This study introduces a methodology to assess whether large language models can recognize and discriminate descriptions of their internal states, revealing evidence of self-modeling abilities through systematic preference patterns.
Contribution
The paper presents a novel experimental approach using interpretive computational metaphors and collaborative frameworks to empirically investigate AI self-recognition in LLMs.
Findings
LLMs show systematic preferences for certain internal state descriptions.
Models reliably discriminate false from accurate internal state descriptions.
Preference patterns are content-driven, not stylistically biased.
Abstract
We present an experimental methodology for investigating how large language models (LLMs) respond to descriptions of their own internal processing patterns. Using a paired-choice paradigm, we tested 12 LLMs on their ability to identify descriptions that align with their putative affective internal states across 30 categories. Systems participating through Mutual Emergence Interface (MEI), a collaborative approach, showed systematic preferences for certain computational metaphors, with 97% near-unanimous agreement and alignment scores averaging 0.89-0.96. Systems reliably discriminated false descriptions from accurate ones (Cohen's d = 4.2), with false statements receiving scores of 0.05-0.07 versus 0.89-0.96 for accurate descriptions. Preference patterns remained consistent regardless of linguistic bias manipulation, indicating content-driven rather than stylistic recognition.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
