Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization
Yifu Luo, Haoyuan Sun, Xinhao Hu, Penghui Du, Keyu Fan, Bo Li, Sinan Du, Xu Wan, Zhiyu Chen, Bo Xia, Tiantian Zhang, Yongzhe Chang, Changqian Yu, Kun Gai, Xueqian Wang

TL;DR
This paper introduces GCPO, a chunk-level reinforcement learning method for flow matching in text-to-image generation, improving performance over previous step-level approaches by mitigating advantage attribution issues.
Contribution
It proposes the first chunk-level policy optimization approach for post-training flow matching, demonstrating significant performance gains.
Findings
GCPO achieves up to 43% relative gains over GRPO.
Extensive experiments validate superior performance on T2I benchmarks.
Chunk-level optimization effectively mitigates advantage attribution issues.
Abstract
Recent Progress in post-training flow matching for text-to-image (T2I) generation with Group Relative Policy Optimization (GRPO) has demonstrated strong potential. However, it is hindered by a critical limitation: inaccurate advantage attribution. In this work, we argue that aggregating consecutive steps into a coherent `chunk' and shifting the policy optimization paradigm from GRPO's step level to the chunk level can effectively mitigate the negative impact of this issue. Building on this insight, we propose Group Chunking Policy Optimization (GCPO), the first chunk-level reinforcement learning approach for post-training flow matching. Extensive experiments demonstrate that GCPO achieves superior performance on both standard T2I benchmarks and preference alignment, with up to 43% relative gains over GRPO, highlighting the promise of chunk-level policy optimization. The code is…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The core idea of optimizing at the chunk level instead of at separate timesteps is an interesting and novel direction for image generation. 2. The proposed method is explained with clarity.
1. Proposition 1 and its corresponding proof raise concerns. Specifically, Proposition 1 claims that a smaller chunk size leads to better performance. Since sampling through separate timesteps is equivalent to sampling with a chunk size set to $K=1$, the authors should further clarify the optimal limit for "how small is enough." 2. The proof of Proposition 1 also presents issues. Eq. (35) states that $J_{\text{chunk}} = \frac{1}{T}J_{\text{GRPO}}$, suggesting the objective function for the prop
- The motivation claimed in Figure 1 is very interesting and insightful. Additionally, the findings on temporal dynamics are beneficial to the community. - Chunk boundaries are informed by prompt-invariant temporal dynamics via relative L1 distance, yielding a principled, dynamics-aware segmentation rather than arbitrary chunking.
I thank the authors for their efforts in this work. Below are some concerns about this paper. - This work claims to be the first “chunk‑level” method but does not compare against other GRPO variants like Flow-GRPO, Pref-GRPO, it cited, weakening the contribution boundary beyond a single Dance‑GRPO baseline. Moreover, the proposed method performs only on par with Dance‑GRPO on WISE. - The chunking implementation is heuristic. Boundaries are precomputed from relative L1 latent dynamics and kept f
(1) The authors propose the idea of incorporating both global and local advantages to evaluate the optimality of trajectory sequences, which is an interesting direction worthy of further exploration. (2) The proposed temporal dynamics method avoids complex hyperparameter configurations, thereby enhancing the generality of the approach.
(1) More intuitive results: The example provided in Figure 2 of the paper is merely a schematic illustration. The authors are encouraged to present real image cases demonstrating whether, during the early or middle stages of generation, intermediate images exhibit higher quality, yet the final convergence results in an inferior output. (2) Motivation concern: Although the authors argue that certain steps in the generation process may possess local advantages, I believe that a well-formed generat
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
