ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation
Hongyu Yan, Qiwei Li, Jiaolong Yang, Yadong Mu

TL;DR
ProgressVLA introduces a novel progress estimation and guidance framework for vision-language robotic manipulation, significantly improving success and generalization in long-horizon tasks.
Contribution
The paper presents a robust progress estimator trained on large-scale datasets and a differentiable progress guidance method using an inverse dynamics world model.
Findings
Achieves low prediction residual of 0.07 in simulation
Demonstrates zero-shot generalization to real-world samples
Improves success rates and generalization on benchmarks and real robots
Abstract
Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical contributions are twofold: (1) \emph{robust progress estimation}: We pre-train a progress estimator on large-scale, unsupervised video-text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of ) in simulation and demonstrates zero-shot generalization to unseen real-world samples, and (2) \emph{differentiable progress guidance}: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
