OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

Liyu Zhang; Kehan Li; Tingrui Han; Tao Zhao; Yuxuan Sheng; Shibo He; Chao Li

arXiv:2604.04142·cs.CV·April 7, 2026

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

Liyu Zhang, Kehan Li, Tingrui Han, Tao Zhao, Yuxuan Sheng, Shibo He, Chao Li

PDF

TL;DR

OP-GRPO introduces an off-policy training framework for flow-matching models, significantly improving training efficiency while maintaining high-quality generation by reusing high-quality trajectories and correcting distribution shifts.

Contribution

It is the first off-policy GRPO framework for flow-matching models, incorporating trajectory replay, importance sampling correction, and trajectory truncation for improved efficiency.

Findings

01

OP-GRPO achieves comparable or better performance than Flow-GRPO.

02

Training efficiency is improved by reducing training steps by 65.8%.

03

The method maintains generation quality across image and video benchmarks.

Abstract

Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO's clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.