CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

Haozhou Li; Xiangyu Dong; Huiyan Jiang; Yaoming Zhou; Xiaoguang Ma

arXiv:2603.07997·cs.AI·March 10, 2026

CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

Haozhou Li, Xiangyu Dong, Huiyan Jiang, Yaoming Zhou, Xiaoguang Ma

PDF

Open Access

TL;DR

CMMR-VLN enhances vision-and-language navigation by integrating structured multimodal memory and reflection mechanisms, significantly improving success rates in complex and unfamiliar environments compared to existing models.

Contribution

This work introduces a novel VLN framework with structured memory retrieval and reflection capabilities, enabling better recall and utilization of prior experiences during navigation.

Findings

01

Achieved up to 52.9% success rate improvement over NavGPT.

02

Demonstrated effective retrieval of relevant experiences during navigation.

03

Showed significant performance gains in both simulation and real-world tests.

Abstract

Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Natural Language Processing Techniques