VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning
Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, Hengshuang Zhao

TL;DR
VLN-R1 introduces an end-to-end framework using large vision-language models for continuous, real-time navigation in embodied AI, surpassing traditional graph-based methods by directly translating egocentric video streams into navigation actions.
Contribution
The paper presents VLN-R1, a novel approach that leverages LVLMs and reinforcement fine-tuning for more effective, data-efficient vision-language navigation in real-world environments.
Findings
Achieves strong performance on VLN-CE benchmark
Demonstrates LVLMs can effectively drive embodied navigation
Employs a two-stage training with supervised and reinforcement fine-tuning
Abstract
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions. Current language model-based navigation systems operate on discrete topological graphs, limiting path planning to predefined node connections. We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions, adopting GRPO-based training inspired by DeepSeek-R1. To enable effective training, we first construct the VLN-Ego dataset using a 3D simulator, Habitat, and propose Long-Short Memory Sampling to balance historical and current observations. While large language models can supervise complete textual instructions, they lack fine-grained action-level control. Our framework employs a two-stage training approach: a)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Speech and dialogue systems
MethodsALIGN
