NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
Dijia Zhan, Jinyi Li, Chenxi Zheng, Shaoyu Huang, Yong Li, Jie Tang, Xuemiao Xu

TL;DR
NavOne introduces a one-step global planning approach for vision-language navigation on top-down maps, significantly improving efficiency and accuracy over previous step-by-step methods.
Contribution
The paper presents NavOne, a novel end-to-end framework for direct dense path prediction in top-down maps, advancing global spatial reasoning in VLN.
Findings
NavOne achieves state-of-the-art performance on R2R-TopDown dataset.
It provides an 8x speedup over existing map-based baselines.
It outperforms egocentric methods by 80x in planning speed.
Abstract
Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
