SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

Xianzhe Fan; Xuhui Zhou; Chuanyang Jin; Kolby Nottingham; Hao Zhu; Maarten Sap

arXiv:2506.23046·cs.CL·December 16, 2025

SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

Xianzhe Fan, Xuhui Zhou, Chuanyang Jin, Kolby Nottingham, Hao Zhu, Maarten Sap

PDF

Open Access 2 Datasets

TL;DR

This paper introduces the SoMi-ToM benchmark to evaluate multi-perspective Theory of Mind in embodied social interactions, highlighting the gap between current models and human performance in complex, multimodal scenarios.

Contribution

The paper presents a novel multimodal, multi-perspective ToM benchmark based on rich interaction data, enabling comprehensive evaluation of models in embodied social contexts.

Findings

01

LVLMs perform significantly worse than humans on ToM tasks

02

The average accuracy gap is 40.1% in first-person evaluation

03

Models need to improve ToM capabilities in complex social interactions

Abstract

Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Social Robot Interaction and HRI