Understanding Social Reasoning in Language Models with Language Models
Kanishk Gandhi, Jan-Philipp Fr\"anken, Tobias Gerstenberg, Noah D., Goodman

TL;DR
This paper introduces BigToM, a new benchmark for assessing social reasoning and Theory-of-Mind in large language models, revealing GPT-4's human-like inference abilities and highlighting limitations in other models.
Contribution
The paper presents a novel framework for generating social reasoning evaluations and creates BigToM, a comprehensive benchmark for testing LLMs' ToM capabilities.
Findings
GPT-4 exhibits ToM capabilities similar to humans.
Other LLMs show limited social reasoning skills.
Human ratings favor the new benchmark over previous evaluations.
Abstract
As Large Language Models (LLMs) become increasingly integrated into our everyday lives, understanding their ability to comprehend human mental states becomes critical for ensuring effective interactions. However, despite the recent attempts to assess the Theory-of-Mind (ToM) reasoning capabilities of LLMs, the degree to which these models can align with human ToM remains a nuanced topic of exploration. This is primarily due to two distinct challenges: (1) the presence of inconsistent results from previous evaluations, and (2) concerns surrounding the validity of existing evaluation methodologies. To address these challenges, we present a novel framework for procedurally generating evaluations with LLMs by populating causal templates. Using our framework, we create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Computational and Text Analysis Methods
MethodsALIGN
