ToMBench: Benchmarking Theory of Mind in Large Language Models

Zhuang Chen; Jincenzi Wu; Jinfeng Zhou; Bosi Wen; Guanqun Bi; Gongyao; Jiang; Yaru Cao; Mengting Hu; Yunghwei Lai; Zexuan Xiong; Minlie Huang

arXiv:2402.15052·cs.CL·December 10, 2024·2 cites

ToMBench: Benchmarking Theory of Mind in Large Language Models

Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao, Jiang, Yaru Cao, Mengting Hu, Yunghwei Lai, Zexuan Xiong, Minlie Huang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces ToMBench, a comprehensive, unbiased benchmark for evaluating Theory of Mind in large language models, revealing that current models still significantly lag behind human social cognition capabilities.

Contribution

The paper presents ToMBench, a systematic, multi-faceted evaluation framework for ToM in LLMs, addressing previous assessment limitations and enabling consistent, unbiased comparisons.

Findings

01

GPT-4 and other LLMs lag over 10% behind humans in ToM tasks.

02

ToMBench covers 8 tasks and 31 abilities in social cognition.

03

Current LLMs have not yet achieved human-level Theory of Mind.

Abstract

Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether large language models (LLMs) exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhchen18/tombench
noneOfficial

Videos

ToMBench: Benchmarking Theory of Mind in Large Language Models· underline

Taxonomy

TopicsTopic Modeling

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing · Adam · Softmax · Layer Normalization · Multi-Head Attention