Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?
Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, and Manling Li

TL;DR
This paper introduces the 'Theory of Space' framework to evaluate how well foundation models can actively explore and build spatial beliefs, revealing key limitations in current models' ability to maintain and update spatial knowledge during autonomous exploration.
Contribution
It proposes a new benchmark and spatial belief probing method to diagnose the active exploration capabilities and limitations of foundation models in spatial tasks.
Findings
Significant performance drop in autonomous exploration scenarios.
Models explore unsystematically compared to program-based proxies.
Global beliefs in models are unstable and degrade over time.
Abstract
Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent's ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must…
Peer Reviews
Decision·ICLR 2026 Poster
- Concepts and definition contribution: ToS captures the ability to (1) construct a globally consistent belief from partial views, (2) update it as new evidence conflicts, (3) utilize it for downstream spatial tasks. The two-phase evaluation separates Exploration to gather observations and build a belief from Reasoning on route/survey/update tasks to test utilization. The task formulation itself is interesting, although this paradigm has been explored in early embodied AI benchmarks like EXCALIB
- The current environment relies on symbolic discretization: angles and distances are bucketed into categorical bins (e.g., {near, mid, far}) and rendered with calibration cues (reference grids, constant lighting). This yields clean supervision but strips away realistic sensory ambiguity such as partial occlusions, depth uncertainty, and texture variation. Models may appear spatially consistent only because the environment removes real-world ambiguities. This inflates performance and may not tra
1. The paper designs spatial belief evaluation around task-agnostic, uncertainty-reducing exploration, rather than passive reasoning and goal-directed task completion, adding an active aspect to spatial belief construction. This makes the paper original. 2. Spatial understanding under partial observability is central for embodied agents and planning. The ToS framework fills in the gap for LLM evaluation in enactive cognition by evaluating the goal-agnostic exploration of active LLM agents. 3.
1. Relatively simple spatial environment and limited statistical results: most experiments use two connected $6$ by $6$ rooms with $9$ objects, and some with varying room size. While the small size of the spatial environment avoids memory capacity as a confounding factor for ToS performance, its simplicity could potentially trivialize the active exploration aspect of the agent. The generalization beyond the grid room is therefore unclear. As a consequence, there are also no statistical results o
Theoretical soundness: The paper is well written and well grounded in cognitive theory. Technical soundness: A stack of tools and technologies is integrated effectively to design the ToS benchmark. A large number of models is evaluated
This work aims to produce a benchmark for human-like spatial understanding and exploration. While this motivation, and the ToS approach as a means of achieving it are very well theoretically justified, the paper offers no evidence of how a human would behave in these tasks. The models are compared to two proxy agents, which they claim to execute a theoretically optimal path, to establish an upper bound on exploration ability. However, no proof of proxy agent optimal is given. Both weaknesses
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Embodied and Extended Cognition
