ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation

Hongyu Yan; Qiwei Li; Jiaolong Yang; Yadong Mu

arXiv:2603.27670·cs.RO·March 31, 2026

ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation

Hongyu Yan, Qiwei Li, Jiaolong Yang, Yadong Mu

PDF

TL;DR

ProgressVLA introduces a novel progress estimation and guidance framework for vision-language robotic manipulation, significantly improving success and generalization in long-horizon tasks.

Contribution

The paper presents a robust progress estimator trained on large-scale datasets and a differentiable progress guidance method using an inverse dynamics world model.

Findings

01

Achieves low prediction residual of 0.07 in simulation

02

Demonstrates zero-shot generalization to real-world samples

03

Improves success rates and generalization on benchmarks and real robots

Abstract

Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical contributions are twofold: (1) \emph{robust progress estimation}: We pre-train a progress estimator on large-scale, unsupervised video-text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of $[0, 1]$ ) in simulation and demonstrates zero-shot generalization to unseen real-world samples, and (2) \emph{differentiable progress guidance}: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.