MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

Haojun Shi; Suyu Ye; Xinyu Fang; Chuanyang Jin; Leyla Isik; Yen-Ling; Kuo; Tianmin Shu

arXiv:2408.12574·cs.AI·January 24, 2025

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling, Kuo, Tianmin Shu

PDF

Open Access 2 Repos 1 Datasets 1 Video

TL;DR

MuMA-ToM introduces a pioneering multi-modal benchmark for multi-agent Theory of Mind reasoning, enabling AI to better understand complex social interactions through multi-modal data and advanced modeling.

Contribution

The paper presents MuMA-ToM, the first multi-modal ToM benchmark for embodied multi-agent interactions, and proposes LIMP, a novel multi-modal ToM model that outperforms existing methods.

Findings

01

LIMP significantly outperforms state-of-the-art models.

02

MuMA-ToM is validated with human experiments and baseline comparisons.

03

The benchmark enables evaluation of mental reasoning in realistic multi-modal scenarios.

Abstract

Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

SCAI-JHU/MUMA-TOM-BENCHMARK
dataset· 1.0k dl
1.0k dl

Videos

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind· underline

Taxonomy

TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation · Language and cultural evolution