Stepwise Credit Assignment for GRPO on Flow-Matching Models
Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi, Subhojyoti Mukherjee, Nikos Vlassis, and Krishna Kumar Singh

TL;DR
This paper introduces Stepwise-Flow-GRPO, a reinforcement learning method that assigns rewards at each diffusion step to improve sample efficiency and convergence in flow models.
Contribution
It proposes a stepwise credit assignment approach using Tweedie's formula and gain-based advantages, enhancing diffusion model training.
Findings
Achieves faster convergence in flow models.
Improves sample efficiency over uniform credit assignment.
Introduces a DDIM-inspired SDE for better reward quality.
Abstract
Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
