Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Andreas Bergmeister, Stefanie Jegelka, Nikolas N\"usken, Carles Domingo-Enrich, Jakiw Pidstrigach

TL;DR
This paper introduces Reinforce Adjoint Matching (RAM), a scalable reinforcement learning method for diffusion and flow-matching models that improves image generation quality without costly computations.
Contribution
It extends the structure of supervised regression pretraining to RL post-training, deriving a simple, scalable consistency loss called RAM that enhances model alignment with rewards.
Findings
RAM achieves the highest reward on composability, text rendering, and human preference.
Reaches Flow-GRPO's peak reward in up to 50 times fewer training steps.
No SDE rollouts or reward gradients are required.
Abstract
Diffusion and flow-matching models scale because pretraining is supervised regression: a clean sample is noised analytically, and a model regresses against a closed-form target. RL post-training aligns the model with a reward. In image generation, this makes samples compose objects correctly, render text legibly, and match human preferences. Existing methods rely on costly SDE rollouts, reward gradients, or surrogate losses, sacrificing pretraining's regression structure. We show that the structure extends to RL post-training. Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution towards samples with higher reward and leaves the noising law unchanged. Combining this with the adjoint-matching optimality condition and a REINFORCE identity, we derive Reinforce Adjoint Matching (RAM): a consistency loss that corrects the pretraining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
