TL;DR
This paper introduces StorySim, a flexible framework for generating stories to evaluate large language models' theory of mind and world modeling capabilities, revealing their strengths and heuristic tendencies.
Contribution
StorySim provides a novel, controllable method for assessing LLMs' mental state reasoning without data contamination, enabling detailed analysis of their ToM and WM skills.
Findings
Models perform better on world modeling than theory of mind tasks.
Models reason more accurately about persons than inanimate objects.
Evidence of heuristic reasoning and over-reliance on early story events.
Abstract
We introduce StorySim, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, or rely on an LLM for generation, StorySim produces novel, compositional story prompts anchored by a highly controllable Storyboard, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of LLMs show that most models achieve higher accuracy on WM tasks than on ToM tasks, and that models tend to reason more accurately when the subject of reasoning is a person rather than an inanimate object. Additionally, our framework enabled us…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
