Multimodal Attention Networks for Low-Level Vision-and-Language Navigation
Federico Landi, Lorenzo Baraldi, Marcella Cornia, Massimiliano, Corsini, Rita Cucchiara

TL;DR
This paper introduces PTA, a fully-attentive Transformer-based architecture for low-level vision-and-language navigation that effectively integrates multiple modalities and long-term dependencies, improving navigation performance.
Contribution
The paper presents PTA, the first Transformer-like model for low-level VLN that combines natural language, images, and actions with early and late fusion strategies.
Findings
PTA achieves promising results on R2R dataset.
PTA performs well on the R4R benchmark.
The model effectively handles multi-modality and long-term dependencies.
Abstract
Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. The goal gets even harder as the actions available to the agent get simpler and move towards low-level, atomic interactions with the environment. This setting takes the name of low-level VLN. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise "Perceive, Transform, and Act" (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities - natural language, images, and low-level actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
