Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

Cristina N. Vasconcelos; Abdullah Rashwan; Austin Waters; Trevor; Walker; Keyang Xu; Jimmy Yan; Rui Qian; Shixin Luo; Zarana Parekh; Andrew; Bunner; Hongliang Fei; Roopal Garg; Mandy Guo; Ivana Kajic; Yeqing Li; Henna; Nandwani; Jordi Pont-Tuset; Yasumasa Onoe; Sarah Rosston; Su Wang; Wenlei; Zhou; Kevin Swersky; David J. Fleet; Jason M. Baldridge; Oliver Wang

arXiv:2405.16759·cs.CV·June 17, 2024

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

Cristina N. Vasconcelos, Abdullah Rashwan, Austin Waters, Trevor, Walker, Keyang Xu, Jimmy Yan, Rui Qian, Shixin Luo, Zarana Parekh, Andrew, Bunner, Hongliang Fei, Roopal Garg, Mandy Guo, Ivana Kajic, Yeqing Li, Henna, Nandwani, Jordi Pont-Tuset, Yasumasa Onoe, Sarah Rosston

PDF

Open Access

TL;DR

This paper introduces a simple greedy growing method for training high-resolution pixel-based diffusion models, eliminating the need for cascaded super-resolution components and enabling stable, large-scale image generation.

Contribution

The authors propose a novel greedy architecture growth algorithm that stabilizes training and scales diffusion models to high resolutions without cascades or additional regularization.

Findings

01

Able to train models up to 8B parameters without extra regularization

02

Achieved high-resolution 1024x1024 image generation with superior human preference

03

Eliminated the need for cascaded super-resolution in diffusion models

Abstract

We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment {\it vs.} high-resolution rendering. We first demonstrate the benefits of scaling a {\it Shallow UNet}, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging Techniques and Applications

MethodsDiffusion