Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks
Yajing Zhou, Xiangyu Kong

TL;DR
This paper investigates the limitations of Multi-Modal Large Language Models in spatial reasoning within multi-agent environments, proposing a novel module and reasoning chain to improve their understanding of second-order Theory of Mind under perceptual constraints.
Contribution
It introduces an Epistemic Sensory Bottleneck module and Anchor-Based Spatial Chain-of-Thought to enhance MLLMs' spatial inference and Theory of Mind capabilities in embodied AI scenarios.
Findings
Current MLLMs achieve 42% accuracy in spatial symmetry tasks.
The proposed reasoning chain outperforms egocentric and allocentric baselines.
Benchmarking reveals fundamental limits in current spatial reasoning abilities.
Abstract
While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B's belief about the environment, governed strictly by Agent B's physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B's estimation of A's relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
