TL;DR
This paper introduces Episodic Transformer, a multimodal model that encodes language and full episode history to improve vision-and-language navigation in dynamic environments, achieving state-of-the-art results.
Contribution
The paper presents a novel Episodic Transformer architecture that effectively encodes episode history and uses synthetic instructions to enhance navigation performance.
Findings
Encoding episode history with a transformer is crucial for compositional tasks.
Pretraining and synthetic instructions significantly boost navigation success.
Achieves new state-of-the-art on ALFRED benchmark with 38.4% and 8.5% success rates.
Abstract
Interaction and navigation defined by natural language instructions in dynamic environments pose significant challenges for neural agents. This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. To improve training, we leverage synthetic instructions as an intermediate representation that decouples understanding the visual appearance of an environment from the variations of natural language instructions. We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance. Our approach sets a new state of the art on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Adam · Layer Normalization · Softmax · Label Smoothing
