Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models
Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling

TL;DR
This paper investigates how language models internally represent beliefs and mental states, revealing that these representations are structured, improve with size and fine-tuning, but are fragile and can be corrected through targeted activation edits.
Contribution
It provides the first systematic analysis of belief representations in LMs, demonstrating their structure, dependence on training, and potential for correction.
Findings
Model size and fine-tuning enhance belief representations.
Belief representations are structured, not spurious.
Targeted activation edits can correct incorrect ToM inferences.
Abstract
Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others. Understanding these internal mechanisms is critical - not only to move beyond surface-level performance, but also for model alignment and safety, where subtle misattributions of mental states may go undetected in generated outputs. In this work, we present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts - using control tasks to rule out confounds. Our experiments provide evidence that both model size and fine-tuning substantially improve LMs' internal representations of others' beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations. Crucially, we show that these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
