When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

Zachary Pedram Dadfar

arXiv:2602.11358·cs.CL·February 19, 2026

When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

Zachary Pedram Dadfar

PDF

Open Access

TL;DR

This paper demonstrates that large language models' self-referential language reflects their internal activation states, and introduces a methodology to identify and influence these states, advancing understanding of model introspection.

Contribution

The study introduces the Pull Methodology to identify activation directions linked to self-referential processing and shows these correlate with introspective language in multiple models.

Findings

01

Self-referential vocabulary tracks activation dynamics.

02

A specific activation direction distinguishes self-referential from descriptive processing.

03

Self-report can reliably reflect internal computational states under certain conditions.

Abstract

Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeurobiology of Language and Bilingualism · Topic Modeling · Action Observation and Synchronization