End-to-End Navigation with Vision Language Models: Transforming Spatial   Reasoning into Question-Answering

Dylan Goetting; Himanshu Gaurav Singh; Antonio Loquercio

arXiv:2411.05755·cs.RO·November 11, 2024

End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

Dylan Goetting, Himanshu Gaurav Singh, Antonio Loquercio

PDF

Open Access 1 Repo

TL;DR

VLMnav demonstrates that vision-language models can be directly used as end-to-end navigation policies in a zero-shot manner, eliminating the need for separate perception, planning, or fine-tuning.

Contribution

This work introduces VLMnav, a novel framework that leverages VLMs for direct, zero-shot navigation without task-specific training or data, enhancing generalizability.

Findings

01

VLMnav outperforms baseline prompting methods in navigation tasks.

02

Zero-shot VLMs can effectively control navigation without fine-tuning.

03

Design analysis reveals key factors impacting performance.

Abstract

We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at https://jirl-upenn.github.io/VLMnav/

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Jirl-upenn/VLMnav
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeographic Information Systems Studies · Semantic Web and Ontologies · Constraint Satisfaction and Optimization