Efficient-VLN: A Training-Efficient Vision-Language Navigation Model

Duo Zheng; Shijia Huang; Yanyang Li; Liwei Wang

arXiv:2512.10310·cs.CV·December 12, 2025

Efficient-VLN: A Training-Efficient Vision-Language Navigation Model

Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang

PDF

Open Access

TL;DR

Efficient-VLN introduces a memory-optimized, training-efficient vision-language navigation model that reduces computational overhead while achieving state-of-the-art results on benchmark datasets.

Contribution

The paper proposes novel memory mechanisms and a dynamic policy to significantly lower training costs in VLN models without sacrificing performance.

Findings

01

Achieves 64.2% SR on R2R-CE and 67.0% SR on RxR-CE.

02

Consumes only 282 GPU hours, greatly reducing training overhead.

03

Outperforms previous state-of-the-art methods in VLN tasks.

Abstract

Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference. To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Neural Network Applications