ProxT2I: Efficient Reward-Guided Text-to-Image Generation via Proximal Diffusion
Zhenghan Fang, Jian Zheng, Qiaozi Gao, Xiaofeng Gao, Jeremias Sulam

TL;DR
ProxT2I introduces a more efficient and stable text-to-image diffusion model using backward discretization and proximal operators, achieving high-quality results with less computation and better human preference alignment.
Contribution
The paper proposes a novel backward discretization diffusion model with learned proximal operators for text-to-image generation, improving efficiency and stability over traditional score-based methods.
Findings
Enhanced sampling efficiency and human-preference alignment.
Achieves competitive results with lower compute and smaller models.
Introduces LAION-Face-T2I-15M dataset for training and evaluation.
Abstract
Diffusion models have emerged as a dominant paradigm for generative modeling across a wide range of domains, including prompt-conditional generation. The vast majority of samplers, however, rely on forward discretization of the reverse diffusion process and use score functions that are learned from data. Such forward and explicit discretizations can be slow and unstable, requiring a large number of sampling steps to produce good-quality samples. In this work we develop a text-to-image (T2I) diffusion model based on backward discretizations, dubbed ProxT2I, relying on learned and conditional proximal operators instead of score functions. We further leverage recent advances in reinforcement learning and policy optimization to optimize our samplers for task-specific rewards. Additionally, we develop a new large-scale and open-source dataset comprising 15 million high-quality human images…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
