History Aware Multimodal Transformer for Vision-and-Language Navigation
Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev

TL;DR
This paper introduces HAMT, a hierarchical transformer model that effectively incorporates long-term visual and linguistic history for improved vision-and-language navigation in complex, real-world environments.
Contribution
The paper presents HAMT, a novel transformer-based architecture that encodes long-horizon history for VLN, surpassing previous recurrent-based memory approaches.
Findings
HAMT achieves state-of-the-art results on multiple VLN benchmarks.
HAMT is especially effective for long-horizon navigation tasks.
The hierarchical encoding improves understanding of spatial and temporal relations.
Abstract
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. To remember previously visited locations and actions taken, most approaches to VLN implement memory using recurrent states. Instead, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in a panoramic observation and finally takes into account temporal relation between panoramas in the history. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single step action prediction and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Adam · Multi-Head Attention · Dropout
