GridMM: Grid Memory Map for Vision-and-Language Navigation

Zihan Wang; Xiangyang Li; Jiahao Yang; Yeqi Liu; Shuqiang; Jiang

arXiv:2307.12907·cs.CV·August 25, 2023

GridMM: Grid Memory Map for Vision-and-Language Navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Shuqiang, Jiang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces GridMM, a novel top-down grid memory map for VLN that effectively captures spatial relations and visual clues, improving navigation performance in various 3D environments.

Contribution

We propose a dynamic, egocentric grid memory map for VLN that better models spatial relations and visual details compared to existing memory methods.

Findings

01

Outperforms existing methods on REVERIE, R2R, SOON datasets

02

Effective in both discrete and continuous environments

03

Demonstrates superior navigation accuracy

Abstract

Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments. To represent the previously visited environment, most approaches for VLN implement memory using recurrent states, topological maps, or top-down semantic maps. In contrast to these approaches, we build the top-down egocentric and dynamically growing Grid Memory Map (i.e., GridMM) to structure the visited environment. From a global perspective, historical observations are projected into a unified grid map in a top-down view, which can better represent the spatial relations of the environment. From a local perspective, we further propose an instruction relevance aggregation method to capture fine-grained visual clues in each grid region. Extensive experiments are conducted on both the REVERIE, R2R, SOON datasets in the discrete environments,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mrzihan/gridmm
pytorchOfficial

Videos

GridMM: Grid Memory Map for Vision-and-Language Navigation· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization