Theory of Mind in Large Language Models: Examining Performance of 11   State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests

Max J. van Duijn; Bram M.A. van Dijk; Tom Kouwenhoven; Werner de Valk,; Marco R. Spruit; and Peter van der Putten

arXiv:2310.20320·cs.CL·November 1, 2023·2 cites

Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests

Max J. van Duijn, Bram M.A. van Dijk, Tom Kouwenhoven, Werner de Valk,, Marco R. Spruit, and Peter van der Putten

PDF

Open Access

TL;DR

This study evaluates the Theory of Mind capabilities of 11 state-of-the-art large language models, comparing their performance to children aged 7-10 on advanced ToM tests, revealing that instruction-tuned models outperform base models and sometimes even children.

Contribution

It introduces a comprehensive assessment of ToM in LLMs beyond false-belief tests, including robustness checks and comparisons with children's performance.

Findings

01

Instruction-tuned GPT models outperform base models and children.

02

Base LLMs generally fail to solve ToM tasks even with prompting.

03

Instruction-tuning enhances LLMs' ability to handle complex social reasoning.

Abstract

To what degree should we ascribe cognitive capacities to Large Language Models (LLMs), such as the ability to reason about intentions and beliefs known as Theory of Mind (ToM)? Here we add to this emerging debate by (i) testing 11 base- and instruction-tuned LLMs on capabilities relevant to ToM beyond the dominant false-belief paradigm, including non-literal language usage and recursive intentionality; (ii) using newly rewritten versions of standardized tests to gauge LLMs' robustness; (iii) prompting and scoring for open besides closed questions; and (iv) benchmarking LLM performance against that of children aged 7-10 on the same tasks. We find that instruction-tuned LLMs from the GPT family outperform other models, and often also children. Base-LLMs are mostly unable to solve ToM tasks, even with specialized prompting. We suggest that the interlinked evolution and development of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Weight Decay · Softmax · Adam · Attention Dropout · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing