TL;DR
This paper introduces SDPO, a reinforcement learning framework designed to improve the alignment of few-step diffusion models with specific objectives by using dense reward signals and novel optimization strategies.
Contribution
We propose a new RL framework, SDPO, that enhances few-step diffusion models with dense rewards, dual-state sampling, and stability techniques for better objective alignment.
Findings
SDPO outperforms existing methods in reward alignment across multiple tasks.
Dense reward strategies improve sample efficiency and policy updates.
Additional refinements enhance stability and long-term dependency handling.
Abstract
Few-step diffusion models enable efficient high-resolution image synthesis but struggle to align with specific downstream objectives due to limitations of existing reinforcement learning (RL) methods in low-step regimes with limited state spaces and suboptimal sample quality. To address this, we propose Stepwise Diffusion Policy Optimization (SDPO), a novel RL framework tailored for few-step diffusion models. SDPO introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization. For further efficiency, we develop a latent similarity-based dense reward prediction strategy to minimize costly dense reward queries. Leveraging these dense rewards, SDPO optimizes a dense reward difference learning objective that enables more frequent and granular policy updates.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion
