A Recurrent Vision-and-Language BERT for Navigation

Yicong Hong; Qi Wu; Yuankai Qi; Cristian Rodriguez-Opazo; Stephen; Gould

arXiv:2011.13922·cs.CV·March 30, 2021·40 cites

A Recurrent Vision-and-Language BERT for Navigation

Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, Stephen, Gould

PDF

Open Access 1 Repo

TL;DR

This paper introduces a recurrent, time-aware BERT model designed for vision-and-language navigation, effectively handling partial observability and maintaining cross-modal state information to improve navigation and referring expression tasks.

Contribution

The paper proposes a novel recurrent BERT architecture tailored for VLN, enabling better history-dependent decision making and achieving state-of-the-art results.

Findings

01

Achieves state-of-the-art results on R2R and REVERIE datasets.

02

Supports pre-training and generalization to other transformer architectures.

03

Capable of solving navigation and referring expression tasks simultaneously.

Abstract

Accuracy of many visiolinguistic tasks has benefited significantly from the application of vision-and-language(V&L) BERT. However, its application for the task of vision-and-language navigation (VLN) remains limited. One reason for this is the difficulty adapting the BERT architecture to the partially observable Markov decision process present in VLN, requiring history-dependent attention and decision making. In this paper we propose a recurrent BERT model that is time-aware for use in VLN. Specifically, we equip the BERT model with a recurrent function that maintains cross-modal state information for the agent. Through extensive experiments on R2R and REVERIE we demonstrate that our model can replace more complex encoder-decoder models to achieve state-of-the-art results. Moreover, our approach can be generalised to other transformer-based architectures, supports pre-training, and is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YicongHong/Recurrent-VLN-BERT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Residual Connection · Attention Dropout · Weight Decay · Attention Is All You Need · Multi-Head Attention · Linear Warmup With Linear Decay