Evaluating Large Language Models in Theory of Mind Tasks
Michal Kosinski

TL;DR
This study evaluates the ability of various large language models to perform Theory of Mind tasks using a comprehensive battery of false-belief scenarios, revealing significant improvements in recent models like ChatGPT-4.
Contribution
It introduces a novel battery of false-belief tasks for LLMs and demonstrates that advanced models can approximate human-level Theory of Mind performance.
Findings
GPT-3.5-turbo solved 20% of tasks
ChatGPT-4 solved 75% of tasks
Older models solved no tasks
Abstract
Eleven Large Language Models (LLMs) were assessed using a custom-made battery of false-belief tasks, considered a gold standard in testing Theory of Mind (ToM) in humans. The battery included 640 prompts spread across 40 diverse tasks, each one including a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. To solve a single task, a model needed to correctly answer 16 prompts across all eight scenarios. Smaller and older models solved no tasks; GPT-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of six-year-old children observed in past studies. We explore the potential interpretation of these findings, including the intriguing possibility that ToM, previously considered exclusive to humans, may…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
"We need AIs with PHYSICAL experience" (Jeff Beck)· youtube
Taxonomy
TopicsTopic Modeling · Robotics and Automated Systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Test · Cosine Annealing · Softmax · Linear Warmup With Cosine Annealing · Residual Connection · Weight Decay · {Dispute@FaQ-s}How to file a dispute with Expedia?
