Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

Nathaniel Getachew; Abulhair Saparov

arXiv:2506.19089·cs.CL·April 28, 2026

Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

Nathaniel Getachew, Abulhair Saparov

PDF

1 Repo

TL;DR

This paper introduces StorySim, a flexible framework for generating stories to evaluate large language models' theory of mind and world modeling capabilities, revealing their strengths and heuristic tendencies.

Contribution

StorySim provides a novel, controllable method for assessing LLMs' mental state reasoning without data contamination, enabling detailed analysis of their ToM and WM skills.

Findings

01

Models perform better on world modeling than theory of mind tasks.

02

Models reason more accurately about persons than inanimate objects.

03

Evidence of heuristic reasoning and over-reliance on early story events.

Abstract

We introduce StorySim, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, or rely on an LLM for generation, StorySim produces novel, compositional story prompts anchored by a highly controllable Storyboard, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of LLMs show that most models achieve higher accuracy on WM tasks than on ToM tasks, and that models tend to reason more accurately when the subject of reasoning is a person rather than an inanimate object. Additionally, our framework enabled us…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.