VSP: Assessing the dual challenges of perception and reasoning in   spatial planning tasks for VLMs

Qiucheng Wu; Handong Zhao; Michael Saxon; Trung Bui; William Yang; Wang; Yang Zhang; Shiyu Chang

arXiv:2407.01863·cs.CL·July 3, 2024

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

Qiucheng Wu, Handong Zhao, Michael Saxon, Trung Bui, William Yang, Wang, Yang Zhang, Shiyu Chang

PDF

Open Access 1 Repo

TL;DR

This paper introduces VSP, a benchmark to evaluate spatial planning in vision language models, revealing significant deficiencies in perception and reasoning that hinder their performance in spatial tasks.

Contribution

The study presents VSP, a novel benchmark that assesses and analyzes the perception and reasoning capabilities of VLMs in spatial planning tasks.

Findings

01

VLMs perform poorly on simple spatial planning tasks.

02

Models show fundamental perception and reasoning deficiencies.

03

Fine-grained analysis explains poor overall performance.

Abstract

Vision language models (VLMs) are an exciting emerging class of language models (LMs) that have merged classic LM capabilities with those of image processing systems. However, the ways that these capabilities combine are not always intuitive and warrant direct investigation. One understudied capability in VLMs is visual spatial planning -- the ability to comprehend the spatial arrangements of objects and devise action plans to achieve desired outcomes in visual scenes. In our study, we introduce VSP, a benchmark that 1) evaluates the spatial planning capability in these models in general, and 2) breaks down the visual planning task into finer-grained sub-tasks, including perception and reasoning, and measure the LMs capabilities in these sub-tasks. Our evaluation shows that both open-source and private VLMs fail to generate effective plans for even simple spatial planning tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucsb-nlp-chang/visual-spatial-planning
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization