TL;DR
TempFlow-GRPO introduces a temporally-aware reinforcement learning framework for flow models, improving human preference alignment and image generation quality by capturing the importance of different generation stages.
Contribution
It presents a novel GRPO framework that leverages the temporal structure of flow models with three key innovations for better credit assignment and exploration.
Findings
Achieves state-of-the-art human preference alignment.
Improves text-to-image generation benchmarks.
Enhances exploration efficiency during training.
Abstract
Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce \textbf{TempFlow-GRPO} (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces three key innovations: (i) a trajectory branching mechanism that provides process rewards by…
Peer Reviews
Decision·ICLR 2026 Poster
For reinforcement learning tasks, dense rewards are crucial for effective credit assignment. The proposed Trajectory Branching mechanism provides an elegant and effective way to obtain dense rewards along the denoising trajectory. The introduced reweighting mechanism offers a valuable analysis of how gradients evolve across steps in baseline algorithms and presents a solution to mitigate the identified issues.
The proposed method involves numerous ODE denoising steps, which substantially increase computational overhead. However, the paper lacks a comparison against the baseline method using training time as the horizontal axis to illustrate efficiency trade-offs. The authors should evaluate the performance of the reweighting mechanism under different $\sigma_t$ schedulers rather than relying solely on the one used in Flow-GRPO, to examine how the choice of scheduler influences its effectiveness. It r
1. The authors astutely identify that the FLOW-GRPO algorithm treats all timesteps equally, and tackle this issue via single-timestep SDE optimization. 2. The noise reweighting method is shown to be effective through both soild theoretical analysis and experiment results. 3. The paper is generally well written with a clear logical structure.
1. The contribution of seed group strategy is relatively small to other parts of the work, and the paper should provide additional details of the seed group strategy. 2. Similarly, MixGRPO [1] proposes a training window of SDE time steps that also tackles the issue of treating all timesteps equally. However, there is limited discussion comparing with MixGRPO. 3. The paper does not discuss the phenomenon of reward hacking, which is an inevitable problem for the GRPO method. [1] Mixgrpo: Unlockin
- The paper is overall well-written and easy to follow. - The motivation and the proposed method are clear and straightforward: addresses the temporal inhomogeneity and credit assignment problems through intermediate resampling for intermediate value estimation and noise-aware reweighting. - The proposed method shows strong empirical performance in both efficiency and end-level performance, with comparisons that include GPU time.
- Theorem 1 is intuitively reasonable, but labeling it as a Theorem feels overstated since the underlying assumptions and proof sketch are insufficiently formalized. The analytical depth is also somewhat limited. - The explanation around line 847 (regarding why the average number of branches is 4.5× when K = 10) is unclear. It is not obvious how this factor arises or how the branching schedule operates, and the paper does not explicitly describe it. - Adding more algorithmic details or pseudocod
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
