TL;DR
V-GRPO introduces a stable, efficient ELBO-based reinforcement learning method for denoising generative models, significantly improving text-to-image synthesis performance and speed.
Contribution
The paper demonstrates that ELBO-based RL can be both stable and efficient, surpassing MDP-based methods in denoising generative models.
Findings
V-GRPO achieves state-of-the-art results in text-to-image synthesis.
It delivers a 2x speedup over MixGRPO.
It delivers a 3x speedup over DiffusionNFT.
Abstract
Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
