SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation

Wenyu Zhang; Wei En Ng; Lixin Ma; Yuwen Wang; Junqi Zhao; Allison Koenecke; Boyang Li; Lu Wang

arXiv:2412.12693·cs.CV·June 10, 2025

SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation

Wenyu Zhang, Wei En Ng, Lixin Ma, Yuwen Wang, Junqi Zhao, Allison Koenecke, Boyang Li, Lu Wang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

SPHERE introduces a hierarchical evaluation framework and dataset to identify and analyze spatial reasoning blind spots in vision-language models, revealing significant deficiencies in complex spatial understanding.

Contribution

The paper presents SPHERE, a novel hierarchical evaluation framework and dataset for assessing spatial reasoning in vision-language models, highlighting their current limitations.

Findings

01

Models struggle with distance and proximity reasoning.

02

Significant gaps in egocentric and allocentric perspective understanding.

03

Current models lack advanced spatial logic application.

Abstract

Current vision-language models may grasp basic spatial cues and simple directions (e.g. left, right, front, back), but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications. To address this gap, we develop SPHERE (Spatial Perception and Hierarchical Evaluation of REasoning), a hierarchical evaluation framework supported by a new human-annotated dataset. SPHERE systematically probes models across increasing levels of complexity, from fundamental skills to multi-skill integration and high-level reasoning that combines spatial, visual, and logical understanding. Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity, understanding both egocentric and allocentric perspectives, and applying spatial logic in physical contexts. These findings expose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zwenyu/SPHERE-VLM
pytorchOfficial

Datasets

wei2912/SPHERE-VLM
dataset· 225 dl
225 dl

Videos

SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation· underline

Taxonomy

TopicsGeographic Information Systems Studies