One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation
Zerui Li, Hongpei Zheng, Fangguo Zhao, Aidan Chan, Jian Zhou, Sihao Lin, Shijie Li, Qi Wu

TL;DR
This paper introduces a decoupled framework for vision-and-language navigation that uses an interactive metric world representation with MLLMs, achieving state-of-the-art zero-shot performance and effective sim-to-real transfer across different robotic platforms.
Contribution
It proposes a novel decoupled design with an interactive metric world representation and counterfactual reasoning, enhancing MLLMs' decision-making in navigation tasks.
Findings
Achieved 48.8% SR in R2R-CE benchmark.
Achieved 42.2% SR in RxR-CE benchmark.
Demonstrated successful zero-shot sim-to-real transfer on diverse robots.
Abstract
A navigable agent needs to understand both high-level semantic instructions and precise spatial perceptions. Building navigation agents centered on Multimodal Large Language Models (MLLMs) demonstrates a promising solution due to their powerful generalization ability. However, the current tightly coupled design dramatically limits system performance. In this work, we propose a decoupled design that separates low-level spatial state estimation from high-level semantic planning. Unlike previous methods that rely on predefined, oversimplified textual maps, we introduce an interactive metric world representation that maintains rich and consistent information, allowing MLLMs to interact with and reason on it for decision-making. Furthermore, counterfactual reasoning is introduced to further elicit MLLMs' capacity, while the metric world representation ensures the physical validity of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Spatial Cognition and Navigation
