SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence
Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Zhaokai Wang, Songyuan Li, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, Rongrong Ji

TL;DR
SpaCE-10 is a new benchmark designed to evaluate the spatial intelligence of multimodal large language models across atomic and compositional capabilities, revealing significant gaps compared to human performance.
Contribution
We introduce SpaCE-10, a comprehensive benchmark with a hierarchical annotation pipeline and extensive QA pairs to evaluate spatial reasoning in MLLMs.
Findings
Current MLLMs lag behind humans in spatial tasks.
Counting ability is a major limiting factor for MLLMs' spatial reasoning.
SpaCE-10 provides insights to improve MLLM spatial capabilities.
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple spatial capabilities, even for handling simple and normal tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper convincingly identifies the lack of a unified benchmark for compositional spatial intelligence, distinguishing SpaCE-10 from prior 2D/3D QA datasets. - The hierarchical annotation pipeline (combining automated generation and human validation) is well thought-out, ensuring both scalability and quality. - By defining 10 atomic spatial capabilities and mapping them to 8 compositional QA types, the benchmark enables fine-grained capability diagnosis rather than raw accuracy comparison. -
- Although human validation is used, the quality and potential bias of GPT-generated QAs could influence benchmark reliability; the paper could analyze this more rigorously. - SpaCE-10 focuses on indoor 3D environments; its applicability to outdoor or dynamic (temporal) spatial reasoning remains unexplored. - While overall accuracy drops are discussed, the paper could provide deeper qualitative examples showing failure modes and reasoning errors.
- Significance: 1. The addressed compositional spatial reasoning ability is indeed important, challenging and useful in real world tasks. 2. The identified performance gap between MLLMs and humans on SpaCE-10 highlights the immediate and practical value of this benchmark in directing future model development. - Originality: The paper offers an original and highly structured framework that 1. clearly defines 10 atomic level and 8 compositional level spatial capabilities, and 2. provides n
1. This paper only covers indoor scenes, especially the housing scenes. While there are lots of different scenes worth investigating, such as indoor inductrial scene (in a factory), and also outdoor scenes. These could further expand the coverage and generality of the scope of this benchmark. 2. Could also provide detialed insights for future research (further discuss the cause and potential improvements for your findings), or indicate example practical tasks for potential applications.
1. The paper clearly explains ten basic spatial skills and combines them into eight question types. This setup makes it easier to see which abilities the models are good or bad at, instead of only looking at overall accuracy. 2. The data collection process is well organized, mixing automated generation with human checking to keep questions accurate and varied. The experiments on about 50 models give useful insights.
1. Overall, the benchmark is limited to indoor scenes, which narrows its scope. Real-world spatial intelligence also involves outdoor and embodied settings, for example, navigation and perception in autonomous driving or robotics. 2. Fixing inputs to 8 images may limit multi-view reasoning. What is the performance when the number of views grows? 3. MCQ-only setup: This misses tasks that need precise outputs (e.g., 3D grounding with (x,y,z) coordinates, path planning). What is the current status
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Geographic Information Systems Studies · Semantic Web and Ontologies
