P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation
Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li, Yongrui Qin, Zongtao He, Chengju Liu, Qijun Chen

TL;DR
P2DNav introduces a hierarchical zero-shot VLN framework that decomposes navigation into panoramic direction selection and local grounding, enhancing stability and performance in unseen environments.
Contribution
The paper proposes P2DNav, a novel hierarchical approach with explicit decomposition and memory mechanisms, improving zero-shot VLN performance over existing methods.
Findings
Achieves significant SR gains on R2R-CE benchmark
Outperforms state-of-the-art zero-shot waypoint-based methods
Demonstrates effectiveness of hierarchical decision-making and memory modules
Abstract
Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360{\deg} panorama, and then predicts a pixel-level target point…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
