GridMM: Grid Memory Map for Vision-and-Language Navigation
Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Shuqiang, Jiang

TL;DR
This paper introduces GridMM, a novel top-down grid memory map for VLN that effectively captures spatial relations and visual clues, improving navigation performance in various 3D environments.
Contribution
We propose a dynamic, egocentric grid memory map for VLN that better models spatial relations and visual details compared to existing memory methods.
Findings
Outperforms existing methods on REVERIE, R2R, SOON datasets
Effective in both discrete and continuous environments
Demonstrates superior navigation accuracy
Abstract
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments. To represent the previously visited environment, most approaches for VLN implement memory using recurrent states, topological maps, or top-down semantic maps. In contrast to these approaches, we build the top-down egocentric and dynamically growing Grid Memory Map (i.e., GridMM) to structure the visited environment. From a global perspective, historical observations are projected into a unified grid map in a top-down view, which can better represent the spatial relations of the environment. From a local perspective, we further propose an instruction relevance aggregation method to capture fine-grained visual clues in each grid region. Extensive experiments are conducted on both the REVERIE, R2R, SOON datasets in the discrete environments,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
GridMM: Grid Memory Map for Vision-and-Language Navigation· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
