SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation

Jiahang Liu; Tianyu Xu; Jiawei Chen; Lu Yue; Jiazhao Zhang; Zhiyong Wang; Minghan Li; Qisheng Zhao; Anqi Li; Qi Su; Zhizheng Zhang; He Wang

arXiv:2603.09163·cs.RO·March 11, 2026

SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation

Jiahang Liu, Tianyu Xu, Jiawei Chen, Lu Yue, Jiazhao Zhang, Zhiyong Wang, Minghan Li, Qisheng Zhao, Anqi Li, Qi Su, Zhizheng Zhang, He Wang

PDF

Open Access

TL;DR

SPAN-Nav introduces a universal 3D spatial awareness model for vision-language navigation, leveraging a compact spatial prior representation and multi-task training to improve generalization in complex environments.

Contribution

It presents SPAN-Nav, a novel end-to-end foundation model that infuses embodied navigation with universal 3D spatial awareness using a single spatial token and extensive occupancy data.

Findings

01

Achieves state-of-the-art results across multiple navigation benchmarks.

02

Demonstrates robust generalization in real-world complex scenarios.

03

Utilizes a massive dataset of 4.2 million occupancy annotations.

Abstract

Recent embodied navigation approaches leveraging Vision-Language Models (VLMs) demonstrate strong generalization in versatile Vision-Language Navigation (VLN). However, reliable path planning in complex environments remains challenging due to insufficient spatial awareness. In this work, we introduce SPAN-Nav, an end-to-end foundation model designed to infuse embodied navigation with universal 3D spatial awareness using RGB video streams. SPAN-Nav extracts spatial priors across diverse scenes through an occupancy prediction task on extensive indoor and outdoor environments. To mitigate the computational burden, we introduce a compact representation for spatial priors, finding that a single token is sufficient to encapsulate the coarse-grained cues essential for navigation tasks. Furthermore, inspired by the Chain-of-Thought (CoT) mechanism, SPAN-Nav utilizes this single spatial token to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization