CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation
Aly Magassouba, Komei Sugiura, and Hisashi Kawai

TL;DR
The paper introduces CrossMap Transformer, a novel model for vision-and-language navigation that uses double back-translation between instructions and paths to improve understanding and generation of navigation commands.
Contribution
It proposes a crossmodal masked path transformer with a double back-translation mechanism for enhanced navigation instruction understanding and generation.
Findings
Improved accuracy in instruction understanding.
Enhanced instruction generation quality.
Effective mutual enhancement between visual and linguistic features.
Abstract
Navigation guided by natural language instructions is particularly suitable for Domestic Service Robots that interacts naturally with users. This task involves the prediction of a sequence of actions that leads to a specified destination given a natural language navigation instruction. The task thus requires the understanding of instructions, such as ``Walk out of the bathroom and wait on the stairs that are on the right''. The Visual and Language Navigation remains challenging, notably because it requires the exploration of the environment and at the accurate following of a path specified by the instructions to model the relationship between language and vision. To address this, we propose the CrossMap Transformer network, which encodes the linguistic and visual features to sequentially generate a path. The CrossMap transformer is tied to a Transformer-based speaker that generates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
Methodstravel james · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Adam · Dropout · Label Smoothing
