From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

Mingrui Wu; Zhaozhi Wang; Fangjinhua Wang; Jiaolong Yang; Marc Pollefeys; Tong Zhang

arXiv:2512.19683·cs.CV·December 30, 2025

From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, Tong Zhang

PDF

Open Access

TL;DR

This paper introduces a large-scale outdoor spatial reasoning benchmark for Multimodal Large Language Models, revealing their reliance on linguistic priors over grounded visual reasoning, and highlights the gap in their spatial intelligence.

Contribution

The paper presents a novel outdoor dataset with metric ground truth for spatial reasoning, enabling comprehensive evaluation of MLLMs' spatial understanding in open-world scenarios.

Findings

01

MLLMs perform poorly in open-world spatial reasoning tasks.

02

Current models rely heavily on linguistic priors rather than visual grounding.

03

Structured indoor benchmarks do not reflect outdoor spatial reasoning capabilities.

Abstract

While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization