Layered Diffusion Model for One-Shot High Resolution Text-to-Image Synthesis
Emaad Khwaja, Abdullah Rashwan, Ting Chen, Oliver Wang, Suraj, Kothawade, Yeqing Li

TL;DR
This paper introduces a layered diffusion model that synthesizes high-resolution images from text descriptions in a single step, using a multi-scale U-Net architecture to improve quality and efficiency.
Contribution
The novel layered U-Net architecture enables high-resolution text-to-image synthesis in one shot, outperforming baseline methods and reducing computational costs.
Findings
Outperforms baseline single-resolution models
Reduces computational cost per step
Achieves higher resolution synthesis without extra models
Abstract
We present a one-shot text-to-image diffusion model that can generate high-resolution images from natural language descriptions. Our model employs a layered U-Net architecture that simultaneously synthesizes images at multiple resolution scales. We show that this method outperforms the baseline of synthesizing images only at the target resolution, while reducing the computational cost per step. We demonstrate that higher resolution synthesis can be achieved by layering convolutions at additional resolution scales, in contrast to other methods which require additional models for super-resolution synthesis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Concatenated Skip Connection · Convolution · U-Net · Diffusion
