Multimodal Transformer with Variable-length Memory for   Vision-and-Language Navigation

Chuang Lin; Yi Jiang; Jianfei Cai; Lizhen Qu; Gholamreza Haffari,; Zehuan Yuan

arXiv:2111.05759·cs.CV·July 19, 2022·1 cites

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari,, Zehuan Yuan

PDF

Open Access 1 Repo

TL;DR

This paper introduces MTVM, a multimodal transformer with variable-length memory for vision-and-language navigation, explicitly modeling long-term temporal context to improve navigation performance.

Contribution

It proposes a novel MTVM model that stores navigation history in a memory bank and uses a memory-aware loss, enhancing temporal context modeling in VLN tasks.

Findings

01

Improves Success Rate by 2% on R2R unseen validation and test sets.

02

Reduces Goal Process by 1.6 million steps on CVDN test set.

03

Outperforms previous methods in long-term temporal context understanding.

Abstract

Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving. Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction via the multimodal cross-attention mechanism. However, these methods usually represent temporal context as a fixed-length vector by using an LSTM decoder or using manually designed hidden states to build a recurrent Transformer. Considering a single fixed-length vector is often insufficient to capture long-term temporal context, in this paper, we introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation by modelling the temporal context explicitly.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clin1223/mtvm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Sigmoid Activation · Adam · Layer Normalization