Cross-modal Map Learning for Vision and Language Navigation

Georgios Georgakis; Karl Schmeckpeper; Karan Wanchoo; Soham Dan; Eleni; Miltsakaki; Dan Roth; Kostas Daniilidis

arXiv:2203.05137·cs.CV·March 22, 2022

Cross-modal Map Learning for Vision and Language Navigation

Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni, Miltsakaki, Dan Roth, Kostas Daniilidis

PDF

1 Repo

TL;DR

This paper introduces a cross-modal map learning approach for vision-and-language navigation, emphasizing explicit spatial representations to improve the association between language and vision, and demonstrates competitive results on benchmark datasets.

Contribution

It proposes a novel cross-modal map learning model that predicts semantic maps and navigational paths using language-informed attention mechanisms.

Findings

01

Effective map-based navigation with language guidance

02

Competitive performance on VLN-CE benchmark

03

Explicit spatial representations enhance language-vision association

Abstract

We consider the problem of Vision-and-Language Navigation (VLN). The majority of current methods for VLN are trained end-to-end using either unstructured memory such as LSTM, or using cross-modal attention over the egocentric observations of the agent. In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations. In this work, we propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions, and then predicts a path towards the goal as a set of waypoints. In both cases, the prediction is informed by the language through cross-modal attention mechanisms. We experimentally test the basic hypothesis that language-driven navigation can be solved given a map, and then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ggeorgak11/cm2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory