Multimodal Attention Networks for Low-Level Vision-and-Language   Navigation

Federico Landi; Lorenzo Baraldi; Marcella Cornia; Massimiliano; Corsini; Rita Cucchiara

arXiv:1911.12377·cs.CV·August 2, 2021

Multimodal Attention Networks for Low-Level Vision-and-Language Navigation

Federico Landi, Lorenzo Baraldi, Marcella Cornia, Massimiliano, Corsini, Rita Cucchiara

PDF

Open Access 1 Repo

TL;DR

This paper introduces PTA, a fully-attentive Transformer-based architecture for low-level vision-and-language navigation that effectively integrates multiple modalities and long-term dependencies, improving navigation performance.

Contribution

The paper presents PTA, the first Transformer-like model for low-level VLN that combines natural language, images, and actions with early and late fusion strategies.

Findings

01

PTA achieves promising results on R2R dataset.

02

PTA performs well on the R4R benchmark.

03

The model effectively handles multi-modality and long-term dependencies.

Abstract

Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. The goal gets even harder as the actions available to the agent get simpler and move towards low-level, atomic interactions with the environment. This setting takes the name of low-level VLN. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise "Perceive, Transform, and Act" (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities - natural language, images, and low-level actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aimagelab/perceive-transform-and-act
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques