Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions
Yuxing Long, Xiaoqi Li, Wenzhe Cai, Hao Dong

TL;DR
This paper introduces DiscussNav, a zero-shot visual language navigation framework where large models act as domain experts and actively discuss to improve navigation accuracy, surpassing existing models and demonstrating real-robot advantages.
Contribution
The paper proposes a novel multi-expert discussion framework for VLN, enabling models to actively consult domain experts, which enhances navigation performance over traditional single-round methods.
Findings
DiscussNav outperforms leading zero-shot VLN models on R2R benchmark.
Active discussions help correct errors and improve environment understanding.
Real-robot experiments confirm the practical advantages of the approach.
Abstract
Visual language navigation (VLN) is an embodied task demanding a wide range of skills encompassing understanding, perception, and planning. For such a multifaceted challenge, previous VLN methods totally rely on one model's own thinking to make predictions within one round. However, existing models, even the most advanced large language model GPT4, still struggle with dealing with multiple tasks by single-round self-thinking. In this work, drawing inspiration from the expert consultation meeting, we introduce a novel zero-shot VLN framework. Within this framework, large models possessing distinct abilities are served as domain experts. Our proposed navigation agent, namely DiscussNav, can actively discuss with these experts to collect essential information before moving at every step. These discussions cover critical navigation subtasks like instruction understanding, environment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Human Pose and Action Recognition
