SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

Xiangyu Dong; Haoran Zhao; Jiang Gao; Haozhou Li; Xiaoguang Ma; Yaoming Zhou; Fuhai Chen; Juan Liu

arXiv:2507.13152·cs.CV·August 27, 2025

SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

Xiangyu Dong, Haoran Zhao, Jiang Gao, Haozhou Li, Xiaoguang Ma, Yaoming Zhou, Fuhai Chen, Juan Liu

PDF

Open Access 5 Reviews

TL;DR

SE-VLN introduces a novel self-evolving framework for vision-language navigation that enables agents to continually learn and improve during testing by leveraging multimodal large language models and experience-based modules.

Contribution

This work is the first to propose a multimodal LLM-powered self-evolving VLN framework, enhancing navigation performance through continual evolution during testing.

Findings

01

Achieved 57% success rate in unseen environments on R2R dataset.

02

Improved performance by 23.9% over state-of-the-art on R2R.

03

Performance increased with more experience, showing effective continual learning.

Abstract

Recent advances in vision-language navigation (VLN) were mainly attributed to emerging large language models (LLMs). These methods exhibited excellent generalization capabilities in instruction understanding and task reasoning. However, they were constrained by the fixed knowledge bases and reasoning abilities of LLMs, preventing fully incorporating experiential knowledge and thus resulting in a lack of efficient evolutionary capacity. To address this, we drew inspiration from the evolution capabilities of natural agents, and proposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with the ability to continuously evolve during testing. To the best of our knowledge, it was the first time that an multimodal LLM-powered self-evolving VLN framework was proposed. Specifically, SE-VLN comprised three core modules, i.e., a hierarchical memory module to transfer successful and…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

1. Introduces a training-free, self-evolving VLN framework that continuously improves during deployment, a relatively unexplored capability in LLM-powered agents. 2. Well-structured architecture combining hierarchical memory, retrieval-augmented reasoning, and reflection aligns closely with human-like learning and adaptability. 3. Comprehensive experimental evaluation on multiple datasets (R2R, REVERIE) and with diverse MLLMs, demonstrating gains over prior methods.

Weaknesses

1. The framework is complex, with multiple handcrafted components, which may reduce accessibility and reproducibility for the broader community. 2. Heavy reliance on specific LLM capabilities (e.g., GPT-4o) risks reduced generality if applied to smaller or weaker models. 3. Limited discussion on computational cost, scalability, and resource requirements for maintaining large experience repositories during test-time deployment. 4. Evaluation is confined to indoor Matterport3D environments;

Reviewer 02Rating 2Confidence 4

Strengths

1. The framework's primary strength is its closed-loop, self-evolving process. Unlike methods that rely on static datasets, SE-VLN actively learns from its mistakes. 2. The organization of this paper is logical.

Weaknesses

1. The paper's core motivation is to emulate the "autonomous evolution" of "natural agents" (e.g., horses, migratory birds) to overcome "data dependency". However, the proposed method's "Reflection Module" is entirely dependent on an omniscient oracle. The "Outcome Evaluator" requires "ground truth data from the MatterPort3D simulator", i.e., $\tau_{gt}$ to calculate metrics and identify failure. This is a form of strong supervision, not the autonomous, experience-driven adaptation of a "natural

Reviewer 03Rating 4Confidence 5

Strengths

1. **Well-Presented and Clear**: The paper is written in a clear and accessible manner, ensuring that readers can easily understand, reproduce, and validate the proposed methods. 2. **Performance Improvements**: The framework achieves significant performance improvements on offline VLN environments.

Weaknesses

1. **Incremental Contribution**: The proposed framework combines existing techniques like retrieval-augmented generation (RAG), chain-of-thought (CoT), and reflection mechanisms to tackle VLN tasks on offline datasets. While it achieves performance improvements, the approach feels incremental and lacks distinctive insights or novel designs, falling short of the quality standards typically expected at conferences like ICLR. 2. **Unfair Comparisons**: Since zero-shot methods in this paper rely

Reviewer 04Rating 4Confidence 4

Strengths

1. As a training-free method, it leverages the capabilities of a pre-trained MLLM, avoiding the need for large-scale annotated data and the complexity of model fine-tuning iterations. 2. The experimental section is comprehensive, covering mainstream datasets (R2R, REVERIE), multiple MLLM base models, and a crucial evaluation of the self-evolution capability.

Weaknesses

1. While the concept of "self-evolution" is appealing, its implementation mechanism "Reflection + Memory Update" is not entirely novel within the field of LLM-based Agents. Numerous existing works have explored continuous learning and policy optimization for LLM agents through reflection and memory mechanisms. The paper needs to more clearly delineate its specific innovations compared to this established body of work. 2. The RAG strategy appears oversimplified. The Experience Retriever relies

Reviewer 05Rating 4Confidence 4

Strengths

- **Quality**: The experimental evaluation is thorough, with comprehensive results on two standard VLN benchmarks (R2R and REVERIE). The performance improvements (23.9% and 15.0% relative gains) are statistically significant and well-documented. The framework's ability to improve with increasing experience repository size provides strong evidence of its self-evolving nature. - **Clarity**: The paper is well-structured and clearly written. The three core modules are logically explained, and the w

Weaknesses

- **Ambiguity around the “training-free” claim**: The paper prominently claims SE-VLN is “training-free”, but this assertion lacks sufficient clarification. Specifically, it is unclear how the experience repository and other modules are initialized. If the repository is constructed from prior navigation episodes on the R2R or REVERIE datasets (the same benchmarks used for evaluation), then the reported performance gains may partly reflect data leakage or overfitting to those environments, rather

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Neural Network Applications