Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

Yajing Zhou; Xiangyu Kong

arXiv:2605.18194·cs.AI·May 19, 2026

Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

Yajing Zhou, Xiangyu Kong

PDF

TL;DR

This paper investigates the limitations of Multi-Modal Large Language Models in spatial reasoning within multi-agent environments, proposing a novel module and reasoning chain to improve their understanding of second-order Theory of Mind under perceptual constraints.

Contribution

It introduces an Epistemic Sensory Bottleneck module and Anchor-Based Spatial Chain-of-Thought to enhance MLLMs' spatial inference and Theory of Mind capabilities in embodied AI scenarios.

Findings

01

Current MLLMs achieve 42% accuracy in spatial symmetry tasks.

02

The proposed reasoning chain outperforms egocentric and allocentric baselines.

03

Benchmarking reveals fundamental limits in current spatial reasoning abilities.

Abstract

While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B's belief about the environment, governed strictly by Agent B's physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B's estimation of A's relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.