One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation

Zerui Li; Hongpei Zheng; Fangguo Zhao; Aidan Chan; Jian Zhou; Sihao Lin; Shijie Li; Qi Wu

arXiv:2602.15400·cs.RO·February 18, 2026

One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation

Zerui Li, Hongpei Zheng, Fangguo Zhao, Aidan Chan, Jian Zhou, Sihao Lin, Shijie Li, Qi Wu

PDF

Open Access

TL;DR

This paper introduces a decoupled framework for vision-and-language navigation that uses an interactive metric world representation with MLLMs, achieving state-of-the-art zero-shot performance and effective sim-to-real transfer across different robotic platforms.

Contribution

It proposes a novel decoupled design with an interactive metric world representation and counterfactual reasoning, enhancing MLLMs' decision-making in navigation tasks.

Findings

01

Achieved 48.8% SR in R2R-CE benchmark.

02

Achieved 42.2% SR in RxR-CE benchmark.

03

Demonstrated successful zero-shot sim-to-real transfer on diverse robots.

Abstract

A navigable agent needs to understand both high-level semantic instructions and precise spatial perceptions. Building navigation agents centered on Multimodal Large Language Models (MLLMs) demonstrates a promising solution due to their powerful generalization ability. However, the current tightly coupled design dramatically limits system performance. In this work, we propose a decoupled design that separates low-level spatial state estimation from high-level semantic planning. Unlike previous methods that rely on predefined, oversimplified textual maps, we introduce an interactive metric world representation that maintains rich and consistent information, allowing MLLMs to interact with and reason on it for decision-making. Furthermore, counterfactual reasoning is introduced to further elicit MLLMs' capacity, while the metric world representation ensures the physical validity of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Spatial Cognition and Navigation