SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation
Wenyu Zhang, Wei En Ng, Lixin Ma, Yuwen Wang, Junqi Zhao, Allison Koenecke, Boyang Li, Lu Wang

TL;DR
SPHERE introduces a hierarchical evaluation framework and dataset to identify and analyze spatial reasoning blind spots in vision-language models, revealing significant deficiencies in complex spatial understanding.
Contribution
The paper presents SPHERE, a novel hierarchical evaluation framework and dataset for assessing spatial reasoning in vision-language models, highlighting their current limitations.
Findings
Models struggle with distance and proximity reasoning.
Significant gaps in egocentric and allocentric perspective understanding.
Current models lack advanced spatial logic application.
Abstract
Current vision-language models may grasp basic spatial cues and simple directions (e.g. left, right, front, back), but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications. To address this gap, we develop SPHERE (Spatial Perception and Hierarchical Evaluation of REasoning), a hierarchical evaluation framework supported by a new human-annotated dataset. SPHERE systematically probes models across increasing levels of complexity, from fundamental skills to multi-skill integration and high-level reasoning that combines spatial, visual, and logical understanding. Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity, understanding both egocentric and allocentric perspectives, and applying spatial logic in physical contexts. These findings expose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGeographic Information Systems Studies
