A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning
Aishwarya Kamath, Peter Anderson, Su Wang, Jing Yu Koh, Alexander Ku,, Austin Waters, Yinfei Yang, Jason Baldridge, Zarana Parekh

TL;DR
This paper introduces a large-scale synthetic dataset of 4.2 million instruction-trajectory pairs for vision-and-language navigation, leveraging imitation learning and synthetic instruction generation to significantly improve agent performance.
Contribution
It presents a novel large-scale synthetic dataset and a simple transformer-based imitation learning approach that outperforms existing methods on VLN tasks.
Findings
Outperforms all existing RL agents on RxR dataset
Improves NDTW from 71.1 to 79.1 in seen environments
Enhances generalization with unseen environments
Abstract
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsTest
