Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner

Chunhui Zhang; Zhongyu Ouyang; Kwonjoon Lee; Nakul Agarwal; Sean Dae Houlihan; Soroush Vosoughi; and Shao-Yuan Lo

arXiv:2506.01301·cs.AI·May 12, 2026

Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner

Chunhui Zhang, Zhongyu Ouyang, Kwonjoon Lee, Nakul Agarwal, Sean Dae Houlihan, Soroush Vosoughi, and Shao-Yuan Lo

PDF

1 Video

TL;DR

This paper introduces a scalable Bayesian planner for Theory-of-Mind reasoning that leverages small and large language models to improve accuracy in complex multimodal social cognition tasks.

Contribution

It proposes a novel stepwise Bayesian approach with weak-to-strong control, enabling scalable and generalizable ToM reasoning across different model sizes.

Findings

01

Achieves 4.6% accuracy improvement over state-of-the-art methods.

02

Effectively generalizes to unseen multimodal ToM scenarios.

03

Demonstrates the benefit of transferring reasoning from small to large models.

Abstract

Theory-of-Mind (ToM) enables humans to infer mental states-such as beliefs, desires, and intentions-forming the foundation of social cognition. However, existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning, which struggle with scalability in multimodal environments and fail to generalize as task complexity increases. To address these limitations, we propose a scalable Bayesian ToM planner that decomposes ToM reasoning into stepwise Bayesian updates. Our framework introduces weak-to-strong control, allowing smaller language models (LMs) to specialize in ToM-specific likelihood estimation and transfer their reasoning behaviors to larger LMs (7B to 405B) for integration with social and world knowledge. This synergistic approach aligns large-model inference of human mental states with Bayesian principles. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner· slideslive