VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents
Xunyi Zhao, Gengze Zhou, Qi Wu

TL;DR
This paper introduces VLN-MME, a modular evaluation framework for assessing multimodal large language models as zero-shot visual navigation agents, revealing their limitations in spatial reasoning and sequential decision-making.
Contribution
It presents a unified, extensible benchmark for evaluating MLLMs in embodied navigation, enabling detailed analysis and revealing their poor spatial reasoning capabilities.
Findings
Enhancing baseline agents with Chain-of-Thought reasoning decreases performance.
MLLMs show poor context awareness in 3D spatial reasoning tasks.
The framework facilitates structured comparisons across diverse models and tasks.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. **The paper is well-written.** The proposed method is well-illustrated and easy to follow. 2. **The simulator-free approach, which pre-renders observations and metadata, is evaluation-friendly.** It lowers the computational barrier to entry for VLN research. 3. **Insightful findings:** The central finding—that CoT and reflection-based reasoning harm performance—is counter-intuitive and impactful. It challenges the prevailing assumption that such techniques are universally beneficial and for
1. **Limited scale of models tested:** The experiments are conducted on MLLMs in the 7B-8B parameter range. While this is representative of current open-source models, the conclusions about the failure of CoT and fundamental reasoning limitations might not generalize to significantly larger, more powerful models (e.g., GPT-4o, Gemini 2.5 Pro). 2. **Simulator-free design is unfavorable for video-based models:** The simulator-free design, while efficient, inherently limits the scope of models th
1. The paper proposes a well-conceived and extensible evaluation suite for VLN tasks, which is a practical tool for the community. The modularity is clearly described, and the simulator-free design reduces computational burden, lowering the barrier to entry for benchmarking and reproducibility. 2. The paper is well structured and easy to follow. The benchmark design, experiments, and analysis are presented in a clear and logical way.
1. The method section focuses on modular and simulator-free design. While I acknowledge the design efforts, I feel they are more like engineering work rather than scientifically driven research. VLN tasks have been deeply studied. I appreciate that the authors probably wrapped the evaluations into easy-to-use APIs, but I feel the this work is not substantially different in identifying the key capabilities of VLMs compared to other existing VLN benchmarks. 2. The authors use four pre-rendered, n
* Studying MLLMs as embodied agents for language-guided visual navigation is an important research direction. * The proposed framework provides a practical and effective way to evaluate MLLM-based agents. * Its simulator-free design significantly reduces the computational cost of simulation.
* Overall, the paper’s contribution is quite limited. Evaluation of MLLM-based navigation agents already appears in EmbodiedBench [1] and EmbodiedEval [2], which cover diverse scenes, instructions, and difficulty levels, and already support modular MLLM evaluation in EmbodiedBench. The paper does not clearly differentiate itself from these closely related efforts, so the true contribution is unclear and quite limited to my understanding. * The paper uses three datasets, R2R, REVERIE, and Object
- simulation/rendering free evaluation - baseline agents - fine-grained metrics
- limited set of tasks. Could include outdoor VLN [1,2] for an additional distinct setting. [1] Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments, Chen et al., 2018 [2] VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View, Schumann et al., 2024
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Speech and dialogue systems
