XToM: Exploring the Multilingual Theory of Mind for Large Language Models

Chunkit Chan; Yauwai Yim; Hongchuan Zeng; Zhiying Zou; Xinyuan Cheng; Zhifan Sun; Zheye Deng; Kawai Chung; Yuzhuo Ao; Yixiang Fan; Cheng Jiayang; Ercong Nie; Ginny Y. Wong; Helmut Schmid; Hinrich Sch\"utze; Simon See; Yangqiu Song

arXiv:2506.02461·cs.CL·June 4, 2025

XToM: Exploring the Multilingual Theory of Mind for Large Language Models

Chunkit Chan, Yauwai Yim, Hongchuan Zeng, Zhiying Zou, Xinyuan Cheng, Zhifan Sun, Zheye Deng, Kawai Chung, Yuzhuo Ao, Yixiang Fan, Cheng Jiayang, Ercong Nie, Ginny Y. Wong, Helmut Schmid, Hinrich Sch\"utze, Simon See, Yangqiu Song

PDF

Open Access

TL;DR

This paper introduces XToM, a multilingual benchmark for evaluating Theory of Mind in large language models across five languages, revealing significant variability and limitations in models' ability to reason about mental states across diverse linguistic contexts.

Contribution

The paper presents XToM, the first validated multilingual ToM benchmark, enabling systematic evaluation of LLMs' mentalizing abilities across multiple languages and scenarios.

Findings

01

LLMs show high proficiency in multilingual understanding but inconsistent ToM performance.

02

Models exhibit limitations in replicating human-like mentalizing across languages.

03

Performance varies significantly across different languages and contexts.

Abstract

Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs' ability to replicate human-like mentalizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques