VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs
Qiucheng Wu, Handong Zhao, Michael Saxon, Trung Bui, William Yang, Wang, Yang Zhang, Shiyu Chang

TL;DR
This paper introduces VSP, a benchmark to evaluate spatial planning in vision language models, revealing significant deficiencies in perception and reasoning that hinder their performance in spatial tasks.
Contribution
The study presents VSP, a novel benchmark that assesses and analyzes the perception and reasoning capabilities of VLMs in spatial planning tasks.
Findings
VLMs perform poorly on simple spatial planning tasks.
Models show fundamental perception and reasoning deficiencies.
Fine-grained analysis explains poor overall performance.
Abstract
Vision language models (VLMs) are an exciting emerging class of language models (LMs) that have merged classic LM capabilities with those of image processing systems. However, the ways that these capabilities combine are not always intuitive and warrant direct investigation. One understudied capability in VLMs is visual spatial planning -- the ability to comprehend the spatial arrangements of objects and devise action plans to achieve desired outcomes in visual scenes. In our study, we introduce VSP, a benchmark that 1) evaluates the spatial planning capability in these models in general, and 2) breaks down the visual planning task into finer-grained sub-tasks, including perception and reasoning, and measure the LMs capabilities in these sub-tasks. Our evaluation shows that both open-source and private VLMs fail to generate effective plans for even simple spatial planning tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization
