TopViewRS: Vision-Language Models as Top-View Spatial Reasoners

Chengzu Li; Caiqi Zhang; Han Zhou; Nigel Collier; Anna Korhonen; Ivan; Vuli\'c

arXiv:2406.02537·cs.CL·June 5, 2024

TopViewRS: Vision-Language Models as Top-View Spatial Reasoners

Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, Ivan, Vuli\'c

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper evaluates the spatial reasoning abilities of vision-language models from a top-view perspective, revealing significant gaps compared to human performance and highlighting the need for improved reasoning capabilities.

Contribution

Introduces the TopViewRS dataset and systematically assesses VLMs' top-view spatial reasoning, exposing their limitations and guiding future research.

Findings

01

VLMs perform over 50% worse than humans on spatial reasoning tasks.

02

Chain-of-Thought reasoning improves VLM performance by 5.82%.

03

VLMs often perform worse than random baselines in some cases.

Abstract

Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human' agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of modern VLMs remain unattested and underexplored. In this work, we thus study their capability to understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different granularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input. We then use it to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cambridgeltl/topviewrs
noneOfficial

Datasets

chengzu/topviewrs
dataset· 49 dl
49 dl

Videos

TopViewRS: Vision-Language Models as Top-View Spatial Reasoners· underline

Taxonomy

TopicsSemantic Web and Ontologies · Geographic Information Systems Studies · Constraint Satisfaction and Optimization

MethodsSparse Evolutionary Training · Focus