SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, Yunjian Zhang

TL;DR
This paper introduces SpatialBench, a comprehensive benchmark and hierarchical framework for evaluating the spatial cognition abilities of multimodal large language models across five levels of complexity.
Contribution
It proposes a hierarchical spatial cognition framework, constructs a detailed benchmark with 15 tasks, and introduces a unified metric for assessing spatial reasoning in MLLMs.
Findings
Models excel in perceptual grounding but struggle with symbolic reasoning.
Performance varies significantly across different cognitive levels.
Humans outperform models in goal-directed spatial abstraction.
Abstract
Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. To address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. To provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
