Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation
Shutian Gu, Chengkai Huang, Ruoyu Wang, Lina Yao

TL;DR
This paper introduces a retrieval-augmented framework to enhance the efficiency and stability of large language model-based vision-and-language navigation by using retrieval modules for better guidance and candidate pruning.
Contribution
It proposes a modular retrieval approach at episode and step levels that improves decision-making without fine-tuning the LLM, demonstrating significant performance gains.
Findings
Improved success rates on R2R benchmark
Retrieval modules reduce prompt complexity and ambiguity
Enhances navigation efficiency and stability
Abstract
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
