Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces
Vebj{\o}rn Haug K{\aa}sene, Pierre Lison

TL;DR
This paper evaluates off-the-shelf large vision-language models for navigation tasks, comparing low-level and panoramic action spaces, and finds they can perform VLN but still underperform specialized models.
Contribution
It demonstrates that off-the-shelf LVLMs can be adapted for VLN tasks across different action spaces without architectural changes.
Findings
Achieved 41% success rate on R2R test set
Off-the-shelf LVLMs can learn VLN tasks
Performance lags behind specialized models
Abstract
Vision-and-Language Navigation (VLN) refers to the task of enabling autonomous robots to navigate unfamiliar environments by following natural language instructions. While recent Large Vision-Language Models (LVLMs) have shown promise in this task, most current VLM systems rely on models specifically designed and optimized for navigation, leaving the potential of off-the-shelf LVLMs underexplored. Furthermore, while older VLN approaches used low-level action spaces with egocentric views and atomic actions (such as "turn left" or "move forward"), newer models tend to favor panoramic action spaces with discrete navigable viewpoints. This paper investigates (1) whether off-the-shelf LVLMs (fine-tuned without architectural modifications or simulator-based training) can effectively support VLN tasks and (2) whether such models can support both low-level and panoramic action paradigms. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
