SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions
Xianzhe Fan, Xuhui Zhou, Chuanyang Jin, Kolby Nottingham, Hao Zhu, Maarten Sap

TL;DR
This paper introduces the SoMi-ToM benchmark to evaluate multi-perspective Theory of Mind in embodied social interactions, highlighting the gap between current models and human performance in complex, multimodal scenarios.
Contribution
The paper presents a novel multimodal, multi-perspective ToM benchmark based on rich interaction data, enabling comprehensive evaluation of models in embodied social contexts.
Findings
LVLMs perform significantly worse than humans on ToM tasks
The average accuracy gap is 40.1% in first-person evaluation
Models need to improve ToM capabilities in complex social interactions
Abstract
Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Social Robot Interaction and HRI
