TL;DR
GRPO-Guard enhances flow-matching reinforcement learning by stabilizing importance ratios and gradient updates, effectively preventing over-optimization and improving model robustness without heavy regularization.
Contribution
The paper introduces GRPO-Guard, a novel method that normalizes importance ratios and reweights gradients to mitigate implicit over-optimization in flow-matching models.
Findings
Significantly reduces over-optimization in diffusion models.
Maintains or improves image quality and alignment metrics.
Effective across multiple diffusion backbones and tasks.
Abstract
Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
