How FaR Are Large Language Models From Agents with Theory-of-Mind?

Pei Zhou; Aman Madaan; Srividya Pranavi Potharaju; Aditya Gupta; Kevin; R. McKee; Ari Holtzman; Jay Pujara; Xiang Ren; Swaroop Mishra; Aida; Nematzadeh; Shyam Upadhyay; Manaal Faruqui

arXiv:2310.03051·cs.CL·October 6, 2023·6 cites

How FaR Are Large Language Models From Agents with Theory-of-Mind?

Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin, R. McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida, Nematzadeh, Shyam Upadhyay, Manaal Faruqui

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

This paper introduces a new evaluation paradigm called Thinking for Doing (T4D) to test if large language models can use Theory-of-Mind in social scenarios, revealing their struggles and proposing a prompting framework to improve their reasoning and action selection.

Contribution

The paper presents T4D as a novel benchmark for assessing LLMs' ability to translate ToM inferences into actions and introduces the FaR prompting method to significantly enhance model performance.

Findings

01

GPT-4's performance improves from 50% to 71% with FaR.

02

LLMs excel at belief tracking but struggle with action guidance.

03

FaR outperforms Chain-of-Thought and Self-Ask methods.

Abstract

"Thinking is for Doing." Humans can infer other people's mental states from observations--an ability called Theory-of-Mind (ToM)--and subsequently act pragmatically on those inferences. Existing question answering benchmarks such as ToMi ask models questions to make inferences about beliefs of characters in a story, but do not test whether models can then use these inferences to guide their actions. We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D), which requires models to connect inferences about others' mental states to actions in social scenarios. Experiments on T4D demonstrate that LLMs such as GPT-4 and PaLM 2 seemingly excel at tracking characters' beliefs in stories, but they struggle to translate this capability into strategic action. Our analysis reveals the core challenge for LLMs lies in identifying the implicit inferences about…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The new prompting framework that encourages LLMs to anticipate future challenges and reason about potential actions is novel and impressive. 2. The paper is well-organized and well-written with clear motivations, detailed discussion, nice figures, and sufficient comparison experiments, making it easy to follow and understand. 3. This work performs comprehensive experiments over benchmark data to show the effectiveness of Far in several in-context settings.

Weaknesses

1. This work illustrates the differences among FaR and GPT-4 in real-world scenarios. I am also curious about what it will look like with different prompting methods, such as Chain-of-Thought and Tree-of-Though. This work does not introduce existing prompting methods in detail, which will confuse the audience to some degree. 2. FaR breaks the T4D process into three steps, question decomposition, theory-of-mind inference, and common sense assumption. However, the motivations for the breakdown ha

Reviewer 02Rating 3· reject, not good enoughConfidence 5

Strengths

- I found the paper to be very clear, with sections and methods well motivated. - The paper collects representative data from human participants to measure performance. - The proposed method improves performance on the given task.

Weaknesses

The paper’s contributions can be split into two parts: A benchmark for testing theory of mind, and a prompting method to achieve. In terms of research, the paper proposes a new problem and a new method. I talk about their weaknesses separately. ### Weakness: T4D - The formal reasoning steps and inferences required for T4D are not clearly specified. For a model to succeed at T4D, it likely needs to make inferences about agents' utilities, beliefs, perceptual access and how helping might alter b

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

This paper is well-written and easy to understand. The authors show convincingly that their prompting method is better than baselines; they use a set of baselines and show it works on multiple datasets. The analysis by adding information to the prompt to find when models start to perform better on the action selection is insightful. The ablations they do throughout the work are helpful for understanding the importance of all parts of the prompting method. The results are interesting; LLMs c

Weaknesses

The authors mention that the main contribution of this work is evaluating the ability of LLMs to take actions based on ToM inferences, but do not mention a paper that does the same: "Understanding Social Reasoning in Language Models with Language Models", Gandhi et al. 2023. This work also tests the ability of LLMs to take appropriate actions in Sally-and-Anne-type scenarios, evaluate humans on it as well, and find similar results (models struggle more with the actions than with the inferences).

Code & Models

Repositories

sachith-gunasekara/t4d
none

Datasets

sachithgunasekara/t4d
dataset· 18 dl
18 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Dense Connections · Linear Layer · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Layer Normalization