TL;DR
MixGRPO introduces a mixed ODE-SDE sampling framework with a sliding window mechanism to improve efficiency and performance in flow-based image generation models, significantly reducing training time.
Contribution
It proposes a novel mixed ODE-SDE sampling strategy with a sliding window to enhance efficiency and effectiveness in flow matching models.
Findings
MixGRPO outperforms DanceGRPO in effectiveness and efficiency.
MixGRPO reduces training time by nearly 50%.
MixGRPO-Flash further cuts training time by 71%.
Abstract
Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO and DanceGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose , a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
For reinforcement learning tasks, improving sampling efficiency is crucial. The authors propose a window-based sampling strategy that accelerates sampling while maintaining diversity.
In LLMs, it is rare to see algorithms that train using only partial tokens. Why is it acceptable in diffusion models to do so, and even achieve strong performance by optimizing only a frozen window? The paper lacks analysis and discussion on this aspect.
1. The paper presents a simple yet clear improvement on training speed over DanceGRPO by introducing a structured sliding-window mechanism and mixed ODE–SDE sampling. 2. By introducing high-order ODE solver, the authors propose a flash version of MixGRPO, which further accelerates sampling but compremising generation quality.
1. The authors remove the KL loss and rely on inference-time hybrid sampling to prevent reward hacking, but this design choice is not sufficiently analyzed. Ablation study would strengthen the claim of stability. 2. The evaluation inconsistently benchmarks against only one prior method, DanceGRPO in some tables, FlowGRPO in others, rather than both across all relevant experiments, hindering a comprehensive assessment of MixGRPO's advantages.
- The mixed sampling approach is an elegant idea that balances stochastic exploration and deterministic efficiency, representing a meaningful step beyond previous all-SDE GRPO variants. - The progressive and decay-based movement of the SDE window is intuitively appealing, resembling a temporal discounting mechanism in RL, and empirically effective. - Results show up to 50–70% faster training while achieving higher or comparable human-preference alignment scores. - The authors perform ablation
- The proof in Appendix A only establishes marginal distribution equivalence between SDE and ODE sampling. It does not **guarantee unbiasedness or convergence** of the RL optimization when the window slides dynamically. Thus, the mixed process remains largely heuristic. - The method introduces multiple coupled hyperparameters ($w, \tau , s$), and optimal values vary across metrics. This raises concerns about reproducibility and robustness across datasets or modalities. - Although per-iteration t
- Technical design: The method is described formally. The authors derive a hybrid ODE–SDE sampling equation and show how it specializes to rectified flows, combining an SDE term within interval S and ODE outside. They then discretize the mixed dynamics using Euler‑Maruyama for the SDE part and Euler for the ODE part. A sliding window W(l) of fixed size w moves along the denoising steps to schedule which timesteps are optimized. Ablations examine different movement strategies (frozen, random, pro
- Heuristic nature and hyperparameter sensitivity: MixGRPO relies heavily on several hyperparameters: the size of the SDE window w, the shift interval τ, the stride s, the movement strategy (frozen, random, progressive), and decay schedule parameters. Ablations show that performance varies substantially with these settings, indicating sensitivity. Yet the paper offers little guidance on analyzing robustness across datasets or models. - Evaluation scope: Experiments focus on one flow‑matching ba
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
