Improving Vision-and-Language Navigation by Generating Future-View Image Semantics
Jialu Li, Mohit Bansal

TL;DR
This paper enhances vision-and-language navigation by enabling agents to generate and predict future view semantics, leading to improved accuracy and interpretability in navigation tasks.
Contribution
It introduces three proxy pre-training tasks for future view semantics generation, significantly improving VLN performance and interpretability.
Findings
Achieves state-of-the-art results on Room-to-Room and CVDN datasets.
Agents can fill in missing future view patches qualitatively.
Better performance on longer navigation paths.
Abstract
Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions. At each step, the agent takes the next action by selecting from a set of navigable locations. In this paper, we aim to take one step further and explore whether the agent can benefit from generating the potential future view during navigation. Intuitively, humans will have an expectation of how the future environment will look like, based on the natural language instructions and surrounding views, which will aid correct navigation. Hence, to equip the agent with this ability to generate the semantics of future navigation views, we first propose three proxy tasks during the agent's in-domain pre-training: Masked Panorama Modeling (MPM), Masked Trajectory Modeling (MTM), and Action Prediction with Image Generation (APIG). These three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
