Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System

Lixuan He; Haoyu Dong; Zhenxing Chen; Yangcheng Yu; Jie Feng; Yong Li

arXiv:2506.19433·cs.CV·October 13, 2025

Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System

Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, Yong Li

PDF

Open Access

TL;DR

Mem4Nav introduces a hierarchical memory system combining spatial and semantic information to significantly improve vision-and-language navigation in urban environments, enhancing long-term reasoning and real-time decision-making.

Contribution

It presents a novel hierarchical spatial-cognition long-short memory system that can augment any VLN backbone, integrating octree and semantic graph structures with trainable memory tokens.

Findings

01

7-13 percentage point gains in Task Completion

02

Significant reduction in shortest path distance (SPD)

03

Over 10 percentage point improvement in normalized Dynamic Time Warping (nDTW)

Abstract

Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce \textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Speech and dialogue systems