Exploring Bottlenecks in VLM-LLM Navigation: How 3D Scene Understanding Capability Impacts Zero-Shot VLN
Ziyi Xia, Chaoran Xiong, Litao Wei, Xinhao Hu, and Ling Pei

TL;DR
This paper investigates how the quality of 3D scene understanding affects zero-shot vision-and-language navigation, revealing that beyond a certain point, better perception accuracy offers limited gains in navigation success.
Contribution
It quantifies the impact of 3D perception capabilities on VLN performance and proposes bounds for system success based on perception quality, highlighting a perception saturation phenomenon.
Findings
Improvements in perception accuracy beyond a threshold yield diminishing returns.
Proposed statistical success rate bounds validate the impact of perception quality on navigation.
Navigation success is more influenced by core vocabularies and bounding box accuracy than pixel-level precision.
Abstract
Zero-shot vision-and-language navigation (VLN) has gained significant attention due to its minimal data collection costs and inherent generalization. This paradigm is typically driven by the integration of pre-trained Vision-Language Models (VLMs) and Large Language Models (LLMs), where VLMs construct 3D scene graphs while LLMs handle high-level reasoning and decision-making. However, a critical bottleneck exists in this system: current 3D perception models prioritize pixel-level accuracy, directly conflicting with the strict computational limits and real-time efficiency demanded by embodied navigation. To address this gap, this paper quantifies the actual impact of 3D scene understanding capability on VLN performance. Based on typical VLM-LLM frameworks, we propose statistical success rate (SR) upper bounds for two core subsystems: 1) the slow LLM planner, which relies on topological…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
