Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models

Siqi Liu; Xinyang Li; Bochao Zou; Junbao Zhuo; Huimin Ma; Jiansheng Chen

arXiv:2603.24484·cs.CV·March 26, 2026

Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models

Siqi Liu, Xinyang Li, Bochao Zou, Junbao Zhuo, Huimin Ma, Jiansheng Chen

PDF

Open Access

TL;DR

This paper introduces VisionToM, a framework that enhances multimodal large language models' ability to infer human mental states from visual data, improving reasoning, interpretability, and alignment in real-world scenarios.

Contribution

We propose VisionToM, a novel intervention method that aligns visual representations with semantic targets, boosting ToM reasoning and interpretability in multimodal models.

Findings

01

Significant improvement in ToM accuracy on EgoToM benchmark

02

Enhanced interpretability through attention analysis

03

Better free-form explanation quality in open-ended tasks

Abstract

As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling