Airbert: In-domain Pretraining for Vision-and-Language Navigation

Pierre-Louis Guhur; Makarand Tapaswi; Shizhe Chen; Ivan Laptev,; Cordelia Schmid

arXiv:2108.09105·cs.CV·August 23, 2021

Airbert: In-domain Pretraining for Vision-and-Language Navigation

Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev,, Cordelia Schmid

PDF

Open Access 2 Repos

TL;DR

Airbert introduces a large-scale in-domain dataset and pretraining method for vision-and-language navigation, significantly improving generalization and performance on multiple benchmarks, especially in few-shot scenarios.

Contribution

The paper presents BnB, a new large-scale in-domain VLN dataset and a pretraining approach that enhances VLN agent generalization and performance.

Findings

01

Outperforms state-of-the-art on R2R and REVERIE benchmarks.

02

Significantly improves few-shot VLN performance.

03

Introduces a shuffling loss for better temporal order learning.

Abstract

Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small-scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB, a large-scale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications