History Aware Multimodal Transformer for Vision-and-Language Navigation

Shizhe Chen; Pierre-Louis Guhur; Cordelia Schmid; Ivan Laptev

arXiv:2110.13309·cs.CV·August 21, 2023·21 cites

History Aware Multimodal Transformer for Vision-and-Language Navigation

Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces HAMT, a hierarchical transformer model that effectively incorporates long-term visual and linguistic history for improved vision-and-language navigation in complex, real-world environments.

Contribution

The paper presents HAMT, a novel transformer-based architecture that encodes long-horizon history for VLN, surpassing previous recurrent-based memory approaches.

Findings

01

HAMT achieves state-of-the-art results on multiple VLN benchmarks.

02

HAMT is especially effective for long-horizon navigation tasks.

03

The hierarchical encoding improves understanding of spatial and temporal relations.

Abstract

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. To remember previously visited locations and actions taken, most approaches to VLN implement memory using recurrent states. Instead, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in a panoramic observation and finally takes into account temporal relation between panoramas in the history. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single step action prediction and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cshizhe/vln-hamt
pytorch

Videos

History Aware Multimodal Transformer for Vision-and-Language Navigation· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Adam · Multi-Head Attention · Dropout