OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi

TL;DR
OmniSpatial is a new, comprehensive benchmark designed to evaluate and challenge vision-language models' spatial reasoning abilities across multiple complex categories, revealing current limitations and proposing strategies for improvement.
Contribution
The paper introduces OmniSpatial, a detailed benchmark with 8.4K questions covering advanced spatial reasoning categories, and explores methods to enhance VLMs' spatial understanding.
Findings
VLMs show significant limitations in comprehensive spatial reasoning.
Explicit scene graph cues improve spatial reasoning performance.
Novel view chain-of-thought enhances reasoning capabilities.
Abstract
Spatial reasoning is a key aspect of cognitive psychology and remains a bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs' understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks cover only the most elementary layer of spatial reasoning and are largely approaching saturation in the latest reasoning models. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through careful manual annotation, we construct over 8.4K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs…
Peer Reviews
Decision·ICLR 2026 Poster
1. Focused and Systematic Scope The paper maintains a clear focus on spatial reasoning, defining it precisely, covering its cognitive dimensions, and avoiding unnecessary general multimodal extensions. This conceptual focus makes OmniSpatial a coherent and practically usable benchmark. 2. Rigorous Manual Curation: - The dataset is human-annotated, multi-sourced, and cross-validated with strong inter-annotator agreement, addressing common weaknesses of synthetic or template-based datasets.
1. Lack of Deep Analysis or Failure Studies The paper could benefit from qualitative examples showing why models fail (e.g., depth reasoning errors, frame-of-reference confusion, or temporal misalignment) 2. Marginal Quantitative Gains The improvements from PointGraph and SpatialCoT are modest (≈1–2 points per dimension), raising questions about their practical impact.
- The paper is well written and easy to understand. - The dataset construction is solid and carefully annotated by humans. - The evaluation is comprehensive.
**Training Data Leakage Concern** - While the dataset is manually curated, some sources (e.g., web images, exam-style questions) may overlap with model pretraining corpora. A clearer discussion on leakage mitigation, measurement, and dataset decontamination would strengthen the benchmark’s credibility. **Compute Cost of SpatialCoT** - The proposed SpatialCoT relies on multi-view synthesis, which appears computationally expensive. A discussion of its runtime, resource requirements, and potential
- The proposed benchmark introduces a new and challenging evaluation setting that explores aspects of spatial reasoning rarely addressed in previous datasets. It is notably more complex and comprehensive than prior benchmarks. - The question annotations involve a human-in-the-loop process to ensure clarity, answer uniqueness, and the resolution of ambiguous spatial references. - The evaluation includes a wide range of VLMs—covering reasoning-focused, open-source, closed-source, and human baselin
- There is no qualitative analysis of failure cases. Investigating these failures would strengthen the paper further. Providing a few examples and categorizing the errors could help reveal which aspects of reasoning need improvement—such as perception, logical reasoning, or consistency. - The paper only demonstrates the effectiveness of SpatialCoT on the perspective-taking task. How does this approach affect performance on other task types? This might raise some concern that it make model perfor
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
