TL;DR
Flash-GRPO introduces a single-step training method for video diffusion models that significantly improves efficiency and stability, achieving state-of-the-art alignment with human preferences at reduced computational costs.
Contribution
It proposes Flash-GRPO, a novel one-step policy optimization framework that overcomes stability issues and enhances training efficiency for large-scale video diffusion models.
Findings
Outperforms full trajectory training in alignment quality.
Reduces training time substantially while maintaining stability.
Validates effectiveness on models up to 14B parameters.
Abstract
Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
