Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models
Siqi Liu, Xinyang Li, Bochao Zou, Junbao Zhuo, Huimin Ma, Jiansheng Chen

TL;DR
This paper introduces VisionToM, a framework that enhances multimodal large language models' ability to infer human mental states from visual data, improving reasoning, interpretability, and alignment in real-world scenarios.
Contribution
We propose VisionToM, a novel intervention method that aligns visual representations with semantic targets, boosting ToM reasoning and interpretability in multimodal models.
Findings
Significant improvement in ToM accuracy on EgoToM benchmark
Enhanced interpretability through attention analysis
Better free-form explanation quality in open-ended tasks
Abstract
As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling
