TL;DR
SOAR is a post-training method that improves diffusion models by self-correcting denoising trajectories, leading to better alignment and refinement without requiring reward signals.
Contribution
It introduces a bias-correction approach that enhances diffusion model training, bridging the gap between supervised fine-tuning and reinforcement learning.
Findings
SOAR improves GenEval scores from 0.70 to 0.78.
SOAR increases OCR accuracy from 0.64 to 0.67.
It surpasses Flow-GRPO in aesthetic and text-image alignment tasks.
Abstract
The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap separates them. SFT optimizes the denoiser only on ground-truth states sampled from the forward noising process; once inference deviates from these ideal states, subsequent denoising relies on out-of-distribution generalization rather than learned correction, exhibiting the same exposure bias that afflicts autoregressive models, but accumulated along the denoising trajectory instead of the token sequence. RL can in principle address this mismatch, yet its terminal reward signal is sparse, suffers from credit-assignment difficulty, and risks reward hacking. We propose SOAR (Self-Correction for Optimal Alignment and Refinement), a bias-correction post-training method that fills this gap. Starting from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
