MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind
Emilio Villa-Cueva, S M Masrur Ahmed, Rendi Chevi, Jan Christian Blaise Cruz, Kareem Elzeky, Fermin Cristobal, Alham Fikri Aji, Skyler Wang, Rada Mihalcea, Thamar Solorio

TL;DR
MoMentS is a comprehensive multimodal benchmark with over 2,300 questions designed to evaluate the Theory of Mind capabilities of multimodal large language models in realistic social scenarios, revealing current limitations in multimodal integration.
Contribution
This paper introduces MoMentS, a new benchmark for assessing multimodal ToM in large language models using realistic video-based social scenarios.
Findings
Vision improves model performance but integration remains challenging.
Audio processing does not consistently outperform transcript-based methods.
Models still struggle with effective multimodal integration in social understanding.
Abstract
Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MoMentS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MoMentS includes over 2,300 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters' mental states. We evaluate several MLLMs and find that although vision generally improves performance, models still struggle to integrate it effectively. For audio, models that process dialogues as audio do not consistently outperform transcript-based inputs. Our findings highlight the need to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Cognitive Science and Education Research · Categorization, perception, and language
