DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation

Haoxiang Shi; Xiang Deng; Zaijing Li; Gongwei Chen; Yaowei Wang; Liqiang Nie

arXiv:2508.09444·cs.RO·August 14, 2025

DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation

Haoxiang Shi, Xiang Deng, Zaijing Li, Gongwei Chen, Yaowei Wang, Liqiang Nie

PDF

TL;DR

This paper introduces DAgger Diffusion Navigation (DifNav), an end-to-end diffusion-based policy for vision-language navigation that outperforms traditional two-stage waypoint methods by modeling multi-modal actions directly in continuous space.

Contribution

The paper proposes a unified diffusion policy for VLN-CE that eliminates the need for waypoint prediction and incorporates DAgger training for robustness and error recovery.

Findings

01

Outperforms previous state-of-the-art models on benchmark datasets.

02

Eliminates reliance on waypoint predictors, simplifying the navigation pipeline.

03

Enhances robustness and long-horizon spatial reasoning in navigation tasks.

Abstract

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces. Existing VLN-CE approaches typically use a two-stage waypoint planning framework, where a high-level waypoint predictor generates the navigable waypoints, and then a navigation planner suggests the intermediate goals in the high-level action space. However, this two-stage decomposition framework suffers from: (1) global sub-optimization due to the proxy objective in each stage, and (2) a performance bottleneck caused by the strong reliance on the quality of the first-stage predicted waypoints. To address these limitations, we propose DAgger Diffusion Navigation (DifNav), an end-to-end optimized VLN-CE policy that unifies the traditional two stages, i.e. waypoint generation and planning, into a single diffusion policy. Notably, DifNav employs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.