City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs
Dwip Dalal, Utkarsh Mishra, Narendra Ahuja, Nebojsa Jojic

TL;DR
This paper introduces a new benchmark, CityNav, to evaluate multimodal large language models in real-world city navigation tasks, highlighting their current limitations and proposing a reasoning-based improvement method.
Contribution
The paper presents CityNav, a novel city navigation benchmark for MLLMs, and proposes Verbalization of Path to improve their reasoning and navigation performance in complex environments.
Findings
State-of-the-art MLLMs underperform on CityNav
Reasoning techniques like chain-of-thought improve performance
Verbalization of Path significantly enhances navigation success
Abstract
Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environment. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
