TopViewRS: Vision-Language Models as Top-View Spatial Reasoners
Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, Ivan, Vuli\'c

TL;DR
This paper evaluates the spatial reasoning abilities of vision-language models from a top-view perspective, revealing significant gaps compared to human performance and highlighting the need for improved reasoning capabilities.
Contribution
Introduces the TopViewRS dataset and systematically assesses VLMs' top-view spatial reasoning, exposing their limitations and guiding future research.
Findings
VLMs perform over 50% worse than humans on spatial reasoning tasks.
Chain-of-Thought reasoning improves VLM performance by 5.82%.
VLMs often perform worse than random baselines in some cases.
Abstract
Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human' agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of modern VLMs remain unattested and underexplored. In this work, we thus study their capability to understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different granularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input. We then use it to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Geographic Information Systems Studies · Constraint Satisfaction and Optimization
MethodsSparse Evolutionary Training · Focus
