Active Visual Information Gathering for Vision-Language Navigation
Hanqing Wang, Wenguan Wang, Tianmin Shu, Wei Liang, Jianbing Shen

TL;DR
This paper introduces an active exploration framework for vision-language navigation, enabling agents to intelligently gather environmental information to improve navigation accuracy in photo-realistic settings.
Contribution
It presents an end-to-end learning approach for an exploration policy that determines when, where, and what to explore, enhancing navigation robustness.
Findings
Significant improvement in navigation performance on R2R benchmark.
Emergence of effective exploration strategies during training.
Enhanced results across all VLN settings, including single run, pre-exploration, and beam search.
Abstract
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments. One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment. Agents trained by current approaches typically suffer from this and would consequently struggle to avoid random and inefficient actions at every step. In contrast, when humans face such a challenge, they can still maintain robust navigation by actively exploring the surroundings to gather more information and thus make more confident navigation decisions. This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent vision-language navigation policy. To achieve this, we propose an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
