How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

Baining Zhao; Ziyou Wang; Jianjie Fang; Zile Zhou; Yanggang Xu; Yatai Ji; Jiacheng Xu; Qian Zhang; Weichen Zhang; Chen Gao; Xinlei Chen

arXiv:2604.07973·cs.AI·April 10, 2026

How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

Baining Zhao, Ziyou Wang, Jianjie Fang, Zile Zhou, Yanggang Xu, Yatai Ji, Jiacheng Xu, Qian Zhang, Weichen Zhang, Chen Gao, Xinlei Chen

PDF

1 Repo 1 Datasets

TL;DR

This paper evaluates large multimodal models' ability to perform goal-oriented spatial navigation in urban airspace, revealing current limitations and proposing directions for enhancement.

Contribution

It introduces a new benchmark dataset and comprehensive assessment of LMMs for embodied urban navigation, highlighting their emerging capabilities and challenges.

Findings

01

LMMs show some action capabilities but are far from human-level performance.

02

Navigation errors tend to diverge rapidly after a decision bifurcation.

03

Analysis of behavior at critical decision points reveals key limitations.

Abstract

Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human through a challenging scenario: goal-oriented navigation in urban 3D spaces. We first spend over 500 hours constructing a dataset comprising 5,037 high-quality goal-oriented navigation samples, with an emphasis on 3D vertical actions and rich urban semantic information. Then, we comprehensively assess 17 representative models, including non-reasoning LMMs, reasoning LMMs, agent-based methods, and vision-language-action models. Experiments show that current LMMs exhibit emerging action capabilities, yet remain far from human-level performance. Furthermore, we reveal an intriguing phenomenon: navigation errors do not accumulate linearly but instead diverge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

serenditipy-AC/Embodied-Navigation-Bench
github

Datasets

EmbodiedCity/EmbodiedNav-Bench
dataset· 436 dl
436 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.