Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Zhawnen Chen; Tianchun Wang; Yizhou Wang; Michal Kosinski; Xiang Zhang; Yun Fu; Sheng Li

arXiv:2406.13763·cs.CV·September 16, 2025

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Zhawnen Chen, Tianchun Wang, Yizhou Wang, Michal Kosinski, Xiang Zhang, Yun Fu, Sheng Li

PDF

Open Access

TL;DR

This paper investigates the ability of large multimodal models to perform human-like emotional and social reasoning in videos, developing a pipeline that combines video and text for explicit theory-of-mind reasoning.

Contribution

It introduces a novel multimodal LLM pipeline that incorporates video and text to perform explicit theory-of-mind reasoning on dynamic scenes.

Findings

01

Multimodal LLMs can reason about mental states in videos.

02

Retrieving key frames enhances ToM reasoning interpretability.

03

The approach demonstrates emergent ToM capabilities in multimodal models.

Abstract

Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Tools and Methods · Digital Storytelling and Education · Speech and dialogue systems