Efficient-VLN: A Training-Efficient Vision-Language Navigation Model
Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang

TL;DR
Efficient-VLN introduces a memory-optimized, training-efficient vision-language navigation model that reduces computational overhead while achieving state-of-the-art results on benchmark datasets.
Contribution
The paper proposes novel memory mechanisms and a dynamic policy to significantly lower training costs in VLN models without sacrificing performance.
Findings
Achieves 64.2% SR on R2R-CE and 67.0% SR on RxR-CE.
Consumes only 282 GPU hours, greatly reducing training overhead.
Outperforms previous state-of-the-art methods in VLN tasks.
Abstract
Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference. To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Neural Network Applications
