Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser; Sumith Kulal; Andreas Blattmann; Rahim Entezari; Jonas; M\"uller; Harry Saini; Yam Levi; Dominik Lorenz; Axel Sauer; Frederic Boesel,; Dustin Podell; Tim Dockhorn; Zion English; Kyle Lacey; Alex Goodwin; Yannik; Marek; Robin Rombach

arXiv:2403.03206·cs.CV·March 6, 2024·86 cites

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas, M\"uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel,, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik, Marek, Robin Rombach

PDF

Open Access 2 Repos 10 Models 2 Datasets

TL;DR

This paper advances high-resolution image synthesis by improving rectified flow models with perceptually biased noise sampling and introducing a novel transformer architecture that enhances text-to-image generation, outperforming existing methods.

Contribution

It presents a new noise sampling technique for rectified flow models and a transformer-based architecture with separate modality weights for improved text-to-image synthesis.

Findings

01

Superior performance over diffusion models in high-res text-to-image tasks

02

Model scaling correlates with better synthesis quality

03

Outperforms state-of-the-art models in human evaluations

Abstract

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging

MethodsDiffusion