End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering
Dylan Goetting, Himanshu Gaurav Singh, Antonio Loquercio

TL;DR
VLMnav demonstrates that vision-language models can be directly used as end-to-end navigation policies in a zero-shot manner, eliminating the need for separate perception, planning, or fine-tuning.
Contribution
This work introduces VLMnav, a novel framework that leverages VLMs for direct, zero-shot navigation without task-specific training or data, enhancing generalizability.
Findings
VLMnav outperforms baseline prompting methods in navigation tasks.
Zero-shot VLMs can effectively control navigation without fine-tuning.
Design analysis reveals key factors impacting performance.
Abstract
We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at https://jirl-upenn.github.io/VLMnav/
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeographic Information Systems Studies · Semantic Web and Ontologies · Constraint Satisfaction and Optimization
