Evaluating Large Language Models in Theory of Mind Tasks

Michal Kosinski

arXiv:2302.02083·cs.CL·November 6, 2024·132 cites

Evaluating Large Language Models in Theory of Mind Tasks

Michal Kosinski

PDF

Open Access 1 Video

TL;DR

This study evaluates the ability of various large language models to perform Theory of Mind tasks using a comprehensive battery of false-belief scenarios, revealing significant improvements in recent models like ChatGPT-4.

Contribution

It introduces a novel battery of false-belief tasks for LLMs and demonstrates that advanced models can approximate human-level Theory of Mind performance.

Findings

01

GPT-3.5-turbo solved 20% of tasks

02

ChatGPT-4 solved 75% of tasks

03

Older models solved no tasks

Abstract

Eleven Large Language Models (LLMs) were assessed using a custom-made battery of false-belief tasks, considered a gold standard in testing Theory of Mind (ToM) in humans. The battery included 640 prompts spread across 40 diverse tasks, each one including a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. To solve a single task, a model needed to correctly answer 16 prompts across all eight scenarios. Smaller and older models solved no tasks; GPT-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of six-year-old children observed in past studies. We explore the potential interpretation of these findings, including the intriguing possibility that ToM, previously considered exclusive to humans, may…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

"We need AIs with PHYSICAL experience" (Jeff Beck)· youtube

Taxonomy

TopicsTopic Modeling · Robotics and Automated Systems

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Test · Cosine Annealing · Softmax · Linear Warmup With Cosine Annealing · Residual Connection · Weight Decay · {Dispute@FaQ-s}How to file a dispute with Expedia?