DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

Haoyou Deng; Keyu Yan; Chaojie Mao; Xiang Wang; Yu Liu; Changxin Gao; Nong Sang

arXiv:2601.20218·cs.CV·February 26, 2026

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

Haoyou Deng, Keyu Yan, Chaojie Mao, Xiang Wang, Yu Liu, Changxin Gao, Nong Sang

PDF

Open Access 3 Reviews

TL;DR

DenseGRPO introduces dense, step-wise rewards for flow matching models in text-to-image generation, improving alignment with human preferences by addressing the sparse reward problem and enhancing training effectiveness.

Contribution

It proposes a novel dense reward prediction framework and an adaptive exploration scheme, significantly improving model alignment and training in flow matching models.

Findings

01

DenseGRPO outperforms previous methods on standard benchmarks.

02

Dense rewards improve fine-grained feedback and model alignment.

03

Adaptive exploration enhances training stability and effectiveness.

Abstract

Recent GRPO-based approaches built on flow matching models have shown remarkable improvements in human preference alignment for text-to-image generation. Nevertheless, they still suffer from the sparse reward problem: the terminal reward of the entire denoising trajectory is applied to all intermediate steps, resulting in a mismatch between the global feedback signals and the exact fine-grained contributions at intermediate denoising steps. To address this issue, we introduce \textbf{DenseGRPO}, a novel framework that aligns human preference with dense rewards, which evaluates the fine-grained contribution of each denoising step. Specifically, our approach includes two key components: (1) we propose to predict the step-wise reward gain as dense reward of each denoising step, which applies a reward model on the intermediate clean images via an ODE-based approach. This manner ensures an…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

* The paper highlights the importance of **dense rewards** for effective **credit assignment** in reinforcement learning tasks. This is indeed a crucial factor that can significantly improve policy optimization stability and efficiency. * Figure 6(c) provides informative results. It clearly demonstrates that the most straightforward approach of predicting the final $x_0$ in a single step to compute delta rewards performs even worse than the original Flow-GRPO, underscoring the necessity of prope

Weaknesses

* The proposed Dense-GRPO algorithm introduces many additional ODE denoising steps, which inevitably increase computation time. However, the paper lacks a comparison with Flow-GRPO under the same training-time horizontal axis to show efficiency differences. * Since the modified algorithm may alter KL-consumption behavior, it would be helpful to visualize how the KL loss evolves during training for both Flow-GRPO and Dense-GRPO. * The proposed Exploration Space Calibration module seems desig

Reviewer 02Rating 6Confidence 3

Strengths

- The paper addresses the well-known issue of sparse reward assignment in GRPO-based flow matching models and provides a conceptually simple fix. - DenseGRPO demonstrates noticeable improvements across several text-to-image benchmarks and alignment metrics (e.g., +1.0 PickScore gain). - Figures and algorithm descriptions are intuitive, especially the visualization of dense reward distributions and the adaptive exploration scheme.

Weaknesses

- Conceptual Incrementality without a Deeper Credit Assignment Analysis While the shift from sparse to dense reward seems natural, the paper does not provide a principled analysis of why ODE-based reward estimation captures per-step contribution more faithfully. The method assumes that ODE rollouts preserve semantic consistency, yet no theoretical or empirical verification supports this. Without a formal treatment of credit assignment or causality, the approach reads more as a heuristic refinem

Reviewer 03Rating 6Confidence 3

Strengths

**Originality:** The paper presents a genuinely novel perspective on the sparse reward problem in flow matching model alignment. While dense rewards have been explored in text generation and diffusion models, the specific ODE-based approach for estimating step-wise rewards without additional specialized models is creative and practical. The identification of exploration space mismatch through dense reward analysis is an insightful observation that leads naturally to the second contribution. The

Weaknesses

**1. Limited Analysis of Reward Hacking:** While Table 1 shows some reward hacking (e.g., Aesthetic Score degradation in compositional generation), the discussion is superficial. The paper doesn't analyze: (a) why certain metrics degrade while others improve, (b) whether dense rewards exacerbate or mitigate reward hacking compared to sparse rewards, (c) qualitative failure cases where the model exploits the reward function, or (d) potential solutions. Given that reward hacking is a critical conc

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Artificial Intelligence in Games