From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs
Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, Tong Zhang

TL;DR
This paper introduces a large-scale outdoor spatial reasoning benchmark for Multimodal Large Language Models, revealing their reliance on linguistic priors over grounded visual reasoning, and highlights the gap in their spatial intelligence.
Contribution
The paper presents a novel outdoor dataset with metric ground truth for spatial reasoning, enabling comprehensive evaluation of MLLMs' spatial understanding in open-world scenarios.
Findings
MLLMs perform poorly in open-world spatial reasoning tasks.
Current models rely heavily on linguistic priors rather than visual grounding.
Structured indoor benchmarks do not reflect outdoor spatial reasoning capabilities.
Abstract
While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
