GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

Jing Wang; Jiajun Liang; Jie Liu; Henglin Liu; Gongye Liu; Jun Zheng; Wanyuan Pang; Ao Ma; Zhenyu Xie; Xintao Wang; Meng Wang; Pengfei Wan; Xiaodan Liang

arXiv:2510.22319·cs.CV·October 31, 2025

GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, Xiaodan Liang

PDF

1 Models

TL;DR

GRPO-Guard enhances flow-matching reinforcement learning by stabilizing importance ratios and gradient updates, effectively preventing over-optimization and improving model robustness without heavy regularization.

Contribution

The paper introduces GRPO-Guard, a novel method that normalizes importance ratios and reweights gradients to mitigate implicit over-optimization in flow-matching models.

Findings

01

Significantly reduces over-optimization in diffusion models.

02

Maintains or improves image quality and alignment metrics.

03

Effective across multiple diffusion backbones and tasks.

Abstract

Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
jing1119/GRPO-Guard
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.