David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training
Weijian Luo, Colin Zhang, Debing Zhang, Zhengyang Geng

TL;DR
This paper introduces Diff-Instruct*, a data-efficient post-training method for small one-step text-to-image models that outperforms large diffusion models in human preference benchmarks by using a novel score-based reinforcement learning approach.
Contribution
It proposes a novel score-based divergence regularization for RLHF, enabling effective post-training of small models to surpass large diffusion models in quality and efficiency.
Findings
Small 2.6B model outperforms 12B diffusion model in key benchmarks
Score-based regularization improves post-training stability and performance
Small model achieves higher scores with significantly less inference time
Abstract
We propose Diff-Instruct* (DI*), a data-efficient post-training approach for one-step text-to-image generative models to improve its human preferences without requiring image data. Our method frames alignment as online reinforcement learning from human feedback (RLHF), which optimizes the one-step model to maximize human reward functions while being regularized to be kept close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the Kullback-Leibler divergence as the regularization, we introduce a novel general score-based divergence regularization that substantially improves performance as well as post-training stability. Although the general score-based RLHF objective is intractable to optimize, we derive a strictly equivalent tractable loss function in theory that can efficiently compute its \emph{gradient} for optimizations. We introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsContrastive Language-Image Pre-training · ALIGN · Diffusion
