U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers
Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, Yunhe Wang

TL;DR
U-DiTs introduce a novel downsampling approach in diffusion transformers, leveraging U-shaped architecture to reduce computation while maintaining or improving image generation performance.
Contribution
The paper proposes U-DiT models that incorporate token downsampling in U-shaped diffusion transformers, significantly reducing computation and enhancing performance over existing DiT models.
Findings
U-DiT outperforms DiT-XL/2 with only 1/6 of the computation cost.
Token downsampling improves efficiency without sacrificing quality.
U-DiT demonstrates superior performance in latent-space image generation.
Abstract
Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy experiment by comparing a U-Net architectured DiT with an isotropic one. It turns out that the U-Net architecture only gain a slight advantage amid the U-Net inductive bias, indicating potential redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, we perform token downsampling on the query-key-value tuple for self-attention that bring further improvements despite a considerable amount of reduction in computation. Based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemiconductor materials and devices
MethodsConcatenated Skip Connection · Convolution · Max Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · U-Net · Diffusion
