MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li; Yutao Cui; Tao Huang; Yinping Ma; Chun Fan; Yiming Cheng; Miles Yang; Zhao Zhong; Liefeng Bo

arXiv:2507.21802·cs.AI·March 23, 2026

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, Liefeng Bo

PDF

1 Models 4 Reviews

TL;DR

MixGRPO introduces a mixed ODE-SDE sampling framework with a sliding window mechanism to improve efficiency and performance in flow-based image generation models, significantly reducing training time.

Contribution

It proposes a novel mixed ODE-SDE sampling strategy with a sliding window to enhance efficiency and effectiveness in flow matching models.

Findings

01

MixGRPO outperforms DanceGRPO in effectiveness and efficiency.

02

MixGRPO reduces training time by nearly 50%.

03

MixGRPO-Flash further cuts training time by 71%.

Abstract

Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO and DanceGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $MixGRPO$ , a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

For reinforcement learning tasks, improving sampling efficiency is crucial. The authors propose a window-based sampling strategy that accelerates sampling while maintaining diversity.

Weaknesses

In LLMs, it is rare to see algorithms that train using only partial tokens. Why is it acceptable in diffusion models to do so, and even achieve strong performance by optimizing only a frozen window? The paper lacks analysis and discussion on this aspect.

Reviewer 02Rating 4Confidence 2

Strengths

1. The paper presents a simple yet clear improvement on training speed over DanceGRPO by introducing a structured sliding-window mechanism and mixed ODE–SDE sampling. 2. By introducing high-order ODE solver, the authors propose a flash version of MixGRPO, which further accelerates sampling but compremising generation quality.

Weaknesses

1. The authors remove the KL loss and rely on inference-time hybrid sampling to prevent reward hacking, but this design choice is not sufficiently analyzed. Ablation study would strengthen the claim of stability. 2. The evaluation inconsistently benchmarks against only one prior method, DanceGRPO in some tables, FlowGRPO in others, rather than both across all relevant experiments, hindering a comprehensive assessment of MixGRPO's advantages.

Reviewer 03Rating 4Confidence 4

Strengths

- The mixed sampling approach is an elegant idea that balances stochastic exploration and deterministic efficiency, representing a meaningful step beyond previous all-SDE GRPO variants. - The progressive and decay-based movement of the SDE window is intuitively appealing, resembling a temporal discounting mechanism in RL, and empirically effective. - Results show up to 50–70% faster training while achieving higher or comparable human-preference alignment scores. - The authors perform ablation

Weaknesses

- The proof in Appendix A only establishes marginal distribution equivalence between SDE and ODE sampling. It does not **guarantee unbiasedness or convergence** of the RL optimization when the window slides dynamically. Thus, the mixed process remains largely heuristic. - The method introduces multiple coupled hyperparameters ($w, \tau , s$), and optimal values vary across metrics. This raises concerns about reproducibility and robustness across datasets or modalities. - Although per-iteration t

Reviewer 04Rating 4Confidence 3

Strengths

- Technical design: The method is described formally. The authors derive a hybrid ODE–SDE sampling equation and show how it specializes to rectified flows, combining an SDE term within interval S and ODE outside. They then discretize the mixed dynamics using Euler‑Maruyama for the SDE part and Euler for the ODE part. A sliding window W(l) of fixed size w moves along the denoising steps to schedule which timesteps are optimized. Ablations examine different movement strategies (frozen, random, pro

Weaknesses

- Heuristic nature and hyperparameter sensitivity: MixGRPO relies heavily on several hyperparameters: the size of the SDE window w, the shift interval τ, the stride s, the movement strategy (frozen, random, progressive), and decay schedule parameters. Ablations show that performance varies substantially with these settings, indicating sensitivity. Yet the paper offers little guidance on analyzing robustness across datasets or models. - Evaluation scope: Experiments focus on one flow‑matching ba

Code & Models

Models

🤗
tulvgengenr/MixGRPO
model· 44 dl· ♡ 8
44 dl♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.