Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
Vaidehi Bagaria, Nikshep Grampurohit, Pulkit Verma

TL;DR
This paper introduces Probabilistic Chunk Masking (PCM), a method that reduces gradient computation in vision-language-action reinforcement learning by selectively focusing on informative trajectory segments, leading to significant speedups.
Contribution
PCM is a novel modification to GRPO that allocates gradient computation to a subset of trajectory chunks based on success-failure variance, improving efficiency without sacrificing success rates.
Findings
PCM achieves 2.38x wall-clock speedup over standard GRPO.
Fewer than 20% of trajectory chunks are backpropagated through with PCM.
PCM reduces peak activation memory by 60% while maintaining success rates.
Abstract
Reinforcement learning (RL) allows vision-language-action (VLA) policies to generalize beyond their training distribution by optimizing directly for task success, but post-training is computationally expensive. A natural response has been to speed rollout collection through faster simulators and world models. In GRPO-based VLA RL, we find that the dominant cost lies elsewhere: gradient computation accounts for approximately 78% of wall-clock time per step in our runs, while rollout collection accounts for only 21%. Gradient cost dominates because much of this computation is spent on phases that contribute little to learning. GRPO's learning signal is driven by advantage variance: only phases where successful and failed rollouts diverge produce learning signal. However, GRPO assigns the same advantage to every chunk in a rollout. As a result, actor-update compute is spent uniformly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
