GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
Jiahao Yang, Zihan Wang, Xiangyang Li, Xing Zhu, Yujun Shen, Yinghao Xu, Shuqiang Jiang

TL;DR
GA-VLN introduces a geometry-aware, compact BEV representation for vision-language navigation, significantly improving efficiency and spatial reasoning by integrating explicit depth cues and learned 3D priors.
Contribution
The paper proposes a novel BEV-based spatial representation that combines explicit geometric cues and implicit 3D priors, enhancing VLN performance and efficiency.
Findings
Achieves state-of-the-art results without DAgger or VQA training.
Reduces token redundancy while preserving spatial information.
Improves navigation robustness and data efficiency.
Abstract
Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) - a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM) - based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
