Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas, M\"uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel,, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik, Marek, Robin Rombach

TL;DR
This paper advances high-resolution image synthesis by improving rectified flow models with perceptually biased noise sampling and introducing a novel transformer architecture that enhances text-to-image generation, outperforming existing methods.
Contribution
It presents a new noise sampling technique for rectified flow models and a transformer-based architecture with separate modality weights for improved text-to-image synthesis.
Findings
Superior performance over diffusion models in high-res text-to-image tasks
Model scaling correlates with better synthesis quality
Outperforms state-of-the-art models in human evaluations
Abstract
Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗stabilityai/stable-diffusion-3.5-largemodel· 76k dl· ♡ 338676k dl♡ 3386
- 🤗stabilityai/stable-diffusion-3-mediummodel· 4.0k dl· ♡ 49224.0k dl♡ 4922
- 🤗stabilityai/stable-diffusion-3.5-mediummodel· 100k dl· ♡ 917100k dl♡ 917
- 🤗stabilityai/stable-diffusion-3-medium-diffusersmodel· 78k dl· ♡ 44278k dl♡ 442
- 🤗silvertuanzi/sd3_medium_backupmodel
- 🤗adamo1139/stable-diffusion-3-medium-ungatedmodel· ♡ 31♡ 31
- 🤗ckpt/stable-diffusion-3-mediummodel· ♡ 11♡ 11
- 🤗lodestones/stable-diffusion-3-mediummodel· ♡ 9♡ 9
- 🤗v2ray/stable-diffusion-3-medium-diffusersmodel· 818 dl· ♡ 8818 dl♡ 8
- 🤗leo009/stable-diffusion-3-mediummodel· 94 dl· ♡ 894 dl♡ 8
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging
MethodsDiffusion
