SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi; Wahid Faisal; Abdur Rahman; Mahfuz Ahmed Anik; Munem Shahriar; Mohsin Mahmud Topu; Sadia Tasnim Meem; Rahatun Nesa Priti; Sabrina Afroz Mitu; Md. Iqramul Hoque; Shahriyar Zaman Ridoy; Mohammed Eunus Ali; Majd Hawasly; Mohammad Raza; Md Rizwan Parvez

arXiv:2602.03916·cs.CV·May 12, 2026

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md. Iqramul Hoque, Shahriyar Zaman Ridoy, Mohammed Eunus Ali, Majd Hawasly, Mohammad Raza, Md Rizwan Parvez

PDF

2 Repos 1 Datasets 1 Video

TL;DR

SpatiaLab introduces a comprehensive benchmark to evaluate vision-language models' spatial reasoning in real-world scenarios, revealing significant gaps compared to human performance and highlighting key challenges for future development.

Contribution

We present SpatiaLab, a large-scale, diverse benchmark for assessing VLMs' spatial reasoning in realistic contexts, addressing limitations of prior synthetic and puzzle-like evaluations.

Findings

01

State-of-the-art VLMs perform significantly worse than humans on spatial reasoning tasks.

02

Models show a 10-25% performance drop in open-ended questions compared to multiple-choice.

03

SpatiaLab exposes critical limitations in handling complex spatial relationships and 3D geometry.

Abstract

Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

ciol-research/SpatiaLab
dataset· 129 dl
129 dl

Videos

SpatiaLab: Can Vision–Language Models Perform Spatial Reasoning in the Wild?· slideslive