Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation
Shuo Wang, Yucheng Wang, Guoxin Lian, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Yutian Zhou, Wanting Li, Deying Li, Zhaoxin Fan

TL;DR
Progress-Think introduces semantic progress reasoning for vision-language navigation, enabling agents to understand their advancement in multi-step instructions through a novel three-stage framework that improves accuracy and efficiency.
Contribution
It proposes a new semantic progress reasoning approach with a three-stage training framework, enhancing navigation performance without requiring expensive annotations.
Findings
Achieves state-of-the-art success rates on R2R-CE and RxR-CE datasets.
Demonstrates improved navigation consistency and efficiency.
Introduces a novel differentiable alignment for progress pretraining.
Abstract
Vision-Language Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction. However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences. Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation. To achieve this without expensive annotations, we propose a three-stage framework. In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes. Then, Progress-Guided Policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
