Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation
Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari,, Zehuan Yuan

TL;DR
This paper introduces MTVM, a multimodal transformer with variable-length memory for vision-and-language navigation, explicitly modeling long-term temporal context to improve navigation performance.
Contribution
It proposes a novel MTVM model that stores navigation history in a memory bank and uses a memory-aware loss, enhancing temporal context modeling in VLN tasks.
Findings
Improves Success Rate by 2% on R2R unseen validation and test sets.
Reduces Goal Process by 1.6 million steps on CVDN test set.
Outperforms previous methods in long-term temporal context understanding.
Abstract
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving. Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction via the multimodal cross-attention mechanism. However, these methods usually represent temporal context as a fixed-length vector by using an LSTM decoder or using manually designed hidden states to build a recurrent Transformer. Considering a single fixed-length vector is often insufficient to capture long-term temporal context, in this paper, we introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation by modelling the temporal context explicitly.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Sigmoid Activation · Adam · Layer Normalization
