NavBench: Probing Multimodal Large Language Models for Embodied Navigation

Yanyuan Qiao; Haodong Hong; Wenqi Lyu; Dong An; Siqi Zhang; Yutong Xie; Xinyu Wang; Qi Wu

arXiv:2506.01031·cs.CV·June 3, 2025

NavBench: Probing Multimodal Large Language Models for Embodied Navigation

Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, Qi Wu

PDF

Open Access

TL;DR

This paper introduces NavBench, a comprehensive benchmark for evaluating the zero-shot embodied navigation capabilities of multimodal large language models, highlighting their strengths and limitations in understanding and acting in indoor environments.

Contribution

NavBench provides a new standardized evaluation framework with diverse tasks and real-world robotic deployment pipeline for assessing MLLMs in embodied navigation.

Findings

01

GPT-4o performs well across tasks

02

Open-source models succeed in simpler cases

03

Higher comprehension scores correlate with better execution

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong generalization in vision-language tasks, yet their ability to understand and act within embodied environments remains underexplored. We present NavBench, a benchmark to evaluate the embodied navigation capabilities of MLLMs under zero-shot settings. NavBench consists of two components: (1) navigation comprehension, assessed through three cognitively grounded tasks including global instruction alignment, temporal progress estimation, and local observation-action reasoning, covering 3,200 question-answer pairs; and (2) step-by-step execution in 432 episodes across 72 indoor scenes, stratified by spatial, cognitive, and execution complexity. To support real-world deployment, we introduce a pipeline that converts MLLMs' outputs into robotic actions. We evaluate both proprietary and open-source models, finding that GPT-4o…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling