Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

Shutian Gu; Chengkai Huang; Ruoyu Wang; Lina Yao

arXiv:2602.15724·cs.CV·February 18, 2026

Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

Shutian Gu, Chengkai Huang, Ruoyu Wang, Lina Yao

PDF

Open Access

TL;DR

This paper introduces a retrieval-augmented framework to enhance the efficiency and stability of large language model-based vision-and-language navigation by using retrieval modules for better guidance and candidate pruning.

Contribution

It proposes a modular retrieval approach at episode and step levels that improves decision-making without fine-tuning the LLM, demonstrating significant performance gains.

Findings

01

Improved success rates on R2R benchmark

02

Retrieval modules reduce prompt complexity and ambiguity

03

Enhances navigation efficiency and stability

Abstract

Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques