Language Models use Lookbacks to Track Beliefs

Nikhil Prakash; Natalie Shapira; Arnab Sen Sharma; Christoph Riedl; Yonatan Belinkov; Tamar Rott Shaham; David Bau; Atticus Geiger

arXiv:2505.14685·cs.CL·February 25, 2026

Language Models use Lookbacks to Track Beliefs

Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how language models track characters' beliefs in stories, revealing a lookback mechanism that retrieves and updates belief information, advancing understanding of Theory of Mind capabilities in LMs.

Contribution

It uncovers a lookback algorithmic pattern in LMs that enables belief tracking and introduces a dataset for analyzing ToM reasoning in language models.

Findings

01

LMs use a lookback mechanism to recall and bind belief-related information.

02

The lookback process involves retrieving state and visibility IDs from low-rank subspaces.

03

The mechanism allows LMs to update beliefs based on character visibility and actions.

Abstract

How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs' ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset, CausalToM, consisting of simple stories where two characters independently change the state of two objects, potentially unaware of each other's actions. Our investigation uncovers a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating their reference information, represented as Ordering IDs (OIs), in low-rank subspaces of the state token's residual stream. When asked about a character's beliefs…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

Overall, the results are surprising and insightful. The analysis is not hand-wavy. * Strong causal methodology. The authors use interchange interventions (activation patching) with carefully matched counterfactual stories to manipulate specific internal variables and measure IIA layer by layer. For example, patching the final “Answer:” token at mid layers redirects the answer pointer (layers 34–52), whereas patching late layers swaps the answer payload (at layers> 56). * Careful dataset desi

Weaknesses

1. Analysis restricted to successful cases. All mechanistic experiments are run on 80 correctly answered examples. This risks selection bias: we only study the circuit when it worked. What happens for incorrect cases? 2. Scaling beyond “first/second” is unclear. tags/OIs encode first vs. second character/object/state, which is perfect for this dataset, but what happens as you scale up? A small toy study (e.g., 3+ entities per type) would help address both weaknesses by revealing failure modes

Reviewer 02Rating 6Confidence 4

Strengths

Unlike previous works in the Theory of Mind (ToM) domain, such as prompt-based (Think twice, TimeToM), tool-based (Social world model), or model-based approaches (Bayesian framework), this paper analyzes the model’s belief reasoning ability from a novel and interpretable perspective. In ToM research, there has long been debate over whether models’ ToM abilities are truly robust, and whether a correct answer to a ToM question genuinely reflects capabiltiy level. Analyzing this issue from the view

Weaknesses

The data pattern of CausalToM mentioned in the paper is quite simple. Theory of Mind (ToM) is a broad framework encompassing various dimensions of mental states, and its scenarios are often diverse and complex. The interpretability analysis in this paper is applied only to a narrow data scope (simple story settings and the belief dimension). When the data scenarios become more complex (e.g., longer narratives or richer social contexts), can this method still maintain good scalability and genera

Reviewer 03Rating 6Confidence 4

Strengths

The paper addresses an extremely challenging and important problem in understanding the internal mechanisms by which language models perform Theory of Mind reasoning. The methodology used to address this question is very clearly laid out, and to this reviewer's mind well motivated. The paper provides clear and useful graphical presentations of the mechanism proposed and the results of the interchange interventions in both the no-visibility and visibility cases. The contribution of a structured

Weaknesses

The paper would be significantly strengthened by addressing the following issues related to soundness and presentation. This reviewer sincerely hopes that these can be satisfactorily addressed in the rebuttal phase. 1) The paper's central claim that models use Ordering IDs rather than identity-based or semantic representations is not adequately distinguished from plausible alternatives. There is only the briefest mention of prior work, and it assumes a great deal of familiarity from the reader

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)