Position: Theory of Mind Benchmarks are Broken for Large Language Models

Matthew Riemer; Zahra Ashktorab; Djallel Bouneffouf; Payel Das; Miao Liu; Justin D. Weisz; and Murray Campbell

arXiv:2412.19726·cs.AI·June 13, 2025

Position: Theory of Mind Benchmarks are Broken for Large Language Models

Matthew Riemer, Zahra Ashktorab, Djallel Bouneffouf, Payel Das, Miao Liu, Justin D. Weisz, and Murray Campbell

PDF

Open Access

TL;DR

This paper critiques current theory of mind benchmarks for large language models, highlighting their inability to test models' adaptation to new partners and proposing a new concept called functional theory of mind.

Contribution

The paper introduces the concept of functional theory of mind, emphasizing the importance of evaluating LLMs' ability to adapt rationally in context, beyond predicting behavior.

Findings

01

Many open source LLMs show strong literal theory of mind.

02

LLMs struggle with functional theory of mind, especially over long interactions.

03

Literal theory of mind performance does not imply functional theory of mind capabilities.

Abstract

Our paper argues that the majority of theory of mind benchmarks are broken because of their inability to directly test how large language models (LLMs) adapt to new partners. This problem stems from the fact that theory of mind benchmarks for LLMs are overwhelmingly inspired by the methods used to test theory of mind in humans and fall victim to a fallacy of attributing human-like qualities to AI agents. We expect that humans will engage in a consistent reasoning process across various questions about a situation, but this is known to not be the case for current LLMs. Most theory of mind benchmarks only measure what we call literal theory of mind: the ability to predict the behavior of others. However, this type of metric is only informative when agents exhibit self-consistent reasoning. Thus, we introduce the concept of functional theory of mind: the ability to adapt to agents…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques