CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double   Back-Translation for Vision-and-Language Navigation

Aly Magassouba; Komei Sugiura; and Hisashi Kawai

arXiv:2103.00852·cs.RO·August 22, 2023·6 cites

CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation

Aly Magassouba, Komei Sugiura, and Hisashi Kawai

PDF

Open Access

TL;DR

The paper introduces CrossMap Transformer, a novel model for vision-and-language navigation that uses double back-translation between instructions and paths to improve understanding and generation of navigation commands.

Contribution

It proposes a crossmodal masked path transformer with a double back-translation mechanism for enhanced navigation instruction understanding and generation.

Findings

01

Improved accuracy in instruction understanding.

02

Enhanced instruction generation quality.

03

Effective mutual enhancement between visual and linguistic features.

Abstract

Navigation guided by natural language instructions is particularly suitable for Domestic Service Robots that interacts naturally with users. This task involves the prediction of a sequence of actions that leads to a specified destination given a natural language navigation instruction. The task thus requires the understanding of instructions, such as ``Walk out of the bathroom and wait on the stairs that are on the right''. The Visual and Language Navigation remains challenging, notably because it requires the exploration of the environment and at the accurate following of a path specified by the instructions to model the relationship between language and vision. To address this, we propose the CrossMap Transformer network, which encodes the linguistic and visual features to sequentially generate a path. The CrossMap transformer is tied to a Transformer-based speaker that generates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques

Methodstravel james · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Adam · Dropout · Label Smoothing