P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

Kai Sheng; Liuyi Wang; Haojie Dai; Jinlong Li; Yongrui Qin; Zongtao He; Chengju Liu; Qijun Chen

arXiv:2605.19634·cs.CV·May 20, 2026

P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li, Yongrui Qin, Zongtao He, Chengju Liu, Qijun Chen

PDF

TL;DR

P2DNav introduces a hierarchical zero-shot VLN framework that decomposes navigation into panoramic direction selection and local grounding, enhancing stability and performance in unseen environments.

Contribution

The paper proposes P2DNav, a novel hierarchical approach with explicit decomposition and memory mechanisms, improving zero-shot VLN performance over existing methods.

Findings

01

Achieves significant SR gains on R2R-CE benchmark

02

Outperforms state-of-the-art zero-shot waypoint-based methods

03

Demonstrates effectiveness of hierarchical decision-making and memory modules

Abstract

Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360{\deg} panorama, and then predicts a pixel-level target point…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.