DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment
Haoyou Deng, Keyu Yan, Chaojie Mao, Xiang Wang, Yu Liu, Changxin Gao, Nong Sang

TL;DR
DenseGRPO introduces dense, step-wise rewards for flow matching models in text-to-image generation, improving alignment with human preferences by addressing the sparse reward problem and enhancing training effectiveness.
Contribution
It proposes a novel dense reward prediction framework and an adaptive exploration scheme, significantly improving model alignment and training in flow matching models.
Findings
DenseGRPO outperforms previous methods on standard benchmarks.
Dense rewards improve fine-grained feedback and model alignment.
Adaptive exploration enhances training stability and effectiveness.
Abstract
Recent GRPO-based approaches built on flow matching models have shown remarkable improvements in human preference alignment for text-to-image generation. Nevertheless, they still suffer from the sparse reward problem: the terminal reward of the entire denoising trajectory is applied to all intermediate steps, resulting in a mismatch between the global feedback signals and the exact fine-grained contributions at intermediate denoising steps. To address this issue, we introduce \textbf{DenseGRPO}, a novel framework that aligns human preference with dense rewards, which evaluates the fine-grained contribution of each denoising step. Specifically, our approach includes two key components: (1) we propose to predict the step-wise reward gain as dense reward of each denoising step, which applies a reward model on the intermediate clean images via an ODE-based approach. This manner ensures an…
Peer Reviews
Decision·ICLR 2026 Poster
* The paper highlights the importance of **dense rewards** for effective **credit assignment** in reinforcement learning tasks. This is indeed a crucial factor that can significantly improve policy optimization stability and efficiency. * Figure 6(c) provides informative results. It clearly demonstrates that the most straightforward approach of predicting the final $x_0$ in a single step to compute delta rewards performs even worse than the original Flow-GRPO, underscoring the necessity of prope
* The proposed Dense-GRPO algorithm introduces many additional ODE denoising steps, which inevitably increase computation time. However, the paper lacks a comparison with Flow-GRPO under the same training-time horizontal axis to show efficiency differences. * Since the modified algorithm may alter KL-consumption behavior, it would be helpful to visualize how the KL loss evolves during training for both Flow-GRPO and Dense-GRPO. * The proposed Exploration Space Calibration module seems desig
- The paper addresses the well-known issue of sparse reward assignment in GRPO-based flow matching models and provides a conceptually simple fix. - DenseGRPO demonstrates noticeable improvements across several text-to-image benchmarks and alignment metrics (e.g., +1.0 PickScore gain). - Figures and algorithm descriptions are intuitive, especially the visualization of dense reward distributions and the adaptive exploration scheme.
- Conceptual Incrementality without a Deeper Credit Assignment Analysis While the shift from sparse to dense reward seems natural, the paper does not provide a principled analysis of why ODE-based reward estimation captures per-step contribution more faithfully. The method assumes that ODE rollouts preserve semantic consistency, yet no theoretical or empirical verification supports this. Without a formal treatment of credit assignment or causality, the approach reads more as a heuristic refinem
**Originality:** The paper presents a genuinely novel perspective on the sparse reward problem in flow matching model alignment. While dense rewards have been explored in text generation and diffusion models, the specific ODE-based approach for estimating step-wise rewards without additional specialized models is creative and practical. The identification of exploration space mismatch through dense reward analysis is an insightful observation that leads naturally to the second contribution. The
**1. Limited Analysis of Reward Hacking:** While Table 1 shows some reward hacking (e.g., Aesthetic Score degradation in compositional generation), the discussion is superficial. The paper doesn't analyze: (a) why certain metrics degrade while others improve, (b) whether dense rewards exacerbate or mitigate reward hacking compared to sparse rewards, (c) qualitative failure cases where the model exploits the reward function, or (d) potential solutions. Given that reward hacking is a critical conc
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Artificial Intelligence in Games
