MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning

Chenghao Liu; Zhimu Zhou; Jiachen Zhang; Minghao Zhang; Songfang Huang; Huiling Duan

arXiv:2508.16654·cs.CV·September 11, 2025

MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning

Chenghao Liu, Zhimu Zhou, Jiachen Zhang, Minghao Zhang, Songfang Huang, Huiling Duan

PDF

TL;DR

MSNav is a robust framework for vision-and-language navigation that combines memory, spatial reasoning, and LLM-based decision-making, significantly improving navigation success rates in complex environments.

Contribution

It introduces a novel integrated architecture with dynamic memory, spatial reasoning, and LLM path planning, along with a new dataset and fine-tuned LLM for enhanced spatial inference.

Findings

01

MSNav achieves state-of-the-art results on R2R and REVERIE datasets.

02

Qwen-Spatial outperforms other LLMs in object list extraction.

03

Memory module improves long-range exploration.

Abstract

Vision-and-Language Navigation (VLN) requires an agent to interpret natural language instructions and navigate complex environments. Current approaches often adopt a "black-box" paradigm, where a single Large Language Model (LLM) makes end-to-end decisions. However, it is plagued by critical vulnerabilities, including poor spatial reasoning, weak cross-modal grounding, and memory overload in long-horizon tasks. To systematically address these issues, we propose Memory Spatial Navigation(MSNav), a framework that fuses three modules into a synergistic architecture, which transforms fragile inference into a robust, integrated intelligence. MSNav integrates three modules: Memory Module, a dynamic map memory module that tackles memory overload through selective node pruning, enhancing long-range exploration; Spatial Module, a module for spatial reasoning and object relationship inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.