ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
Youhe Feng, Hansen Shi, Haoyang Li, Xinlei Guo, Yang Wang, Chengyang Zhang, Jinkai Zhang, Xiaohan Zhang, Jie Tang, Jing Zhang

TL;DR
ProcVLM is a vision-language model that learns dense, procedure-grounded progress rewards for robotic manipulation, improving task understanding and policy optimization.
Contribution
It introduces a novel procedure-grounded progress estimation method based on intra-stage visual change and intra-stage reasoning, trained on a large-scale annotated dataset.
Findings
ProcVLM achieves superior procedural reasoning in experiments.
It provides more discriminative progress estimates than baseline models.
ProcVLM enhances reward-guided policy learning in robotic manipulation.
Abstract
Long-horizon robotic manipulation requires dense feedback that reflects how a task advances through its procedural stages, not merely whether the final outcome is successful. Existing reward models often rely on trajectory-level success labels or time-based interpolation, which can conflate elapsed time with true task progress and therefore fail to capture unfinished steps, stagnation, and failure states. We present ProcVLM, a progress-aware vision-language model that learns procedure-grounded progress as a dense reward signal for manipulation. Rather than deriving progress from terminal outcomes or temporal proxies, ProcVLM grounds progress estimation in procedural structure and intra-stage visual change, and further adopts a reasoning-before-estimation paradigm that infers the remaining atomic actions before estimating task progress. Specifically, we construct this supervision by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
