VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

Zhangyang Qi; Zhixiong Zhang; Yizhou Yu; Jiaqi Wang; Hengshuang Zhao

arXiv:2506.17221·cs.CV·June 26, 2025

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, Hengshuang Zhao

PDF

Open Access 1 Repo

TL;DR

VLN-R1 introduces an end-to-end framework using large vision-language models for continuous, real-time navigation in embodied AI, surpassing traditional graph-based methods by directly translating egocentric video streams into navigation actions.

Contribution

The paper presents VLN-R1, a novel approach that leverages LVLMs and reinforcement fine-tuning for more effective, data-efficient vision-language navigation in real-world environments.

Findings

01

Achieves strong performance on VLN-CE benchmark

02

Demonstrates LVLMs can effectively drive embodied navigation

03

Employs a two-stage training with supervised and reinforcement fine-tuning

Abstract

Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions. Current language model-based navigation systems operate on discrete topological graphs, limiting path planning to predefined node connections. We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions, adopting GRPO-based training inspired by DeepSeek-R1. To enable effective training, we first construct the VLN-Ego dataset using a 3D simulator, Habitat, and propose Long-Short Memory Sampling to balance historical and current observations. While large language models can supervise complete textual instructions, they lack fine-grained action-level control. Our framework employs a two-stage training approach: a)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Qi-Zhangyang/GPT4Scene
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Speech and dialogue systems

MethodsALIGN