Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces

Vebj{\o}rn Haug K{\aa}sene; Pierre Lison

arXiv:2508.02917·cs.CV·August 6, 2025

Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces

Vebj{\o}rn Haug K{\aa}sene, Pierre Lison

PDF

TL;DR

This paper evaluates off-the-shelf large vision-language models for navigation tasks, comparing low-level and panoramic action spaces, and finds they can perform VLN but still underperform specialized models.

Contribution

It demonstrates that off-the-shelf LVLMs can be adapted for VLN tasks across different action spaces without architectural changes.

Findings

01

Achieved 41% success rate on R2R test set

02

Off-the-shelf LVLMs can learn VLN tasks

03

Performance lags behind specialized models

Abstract

Vision-and-Language Navigation (VLN) refers to the task of enabling autonomous robots to navigate unfamiliar environments by following natural language instructions. While recent Large Vision-Language Models (LVLMs) have shown promise in this task, most current VLM systems rely on models specifically designed and optimized for navigation, leaving the potential of off-the-shelf LVLMs underexplored. Furthermore, while older VLN approaches used low-level action spaces with egocentric views and atomic actions (such as "turn left" or "move forward"), newer models tend to favor panoramic action spaces with discrete navigable viewpoints. This paper investigates (1) whether off-the-shelf LVLMs (fine-tuned without architectural modifications or simulator-based training) can effectively support VLN tasks and (2) whether such models can support both low-level and panoramic action paradigms. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.