PiT: Progressive Diffusion Transformer
Jiafu Wu, Yabiao Wang, Jian Li, Jinlong Peng, Yun Cao, Chengjie Wang, Jiangning Zhang

TL;DR
This paper introduces PiT, a new diffusion transformer architecture that reduces computational costs by addressing global attention redundancy and enhancing local-global information exchange, leading to improved image generation performance.
Contribution
The paper proposes PSWA and PCCA techniques to mitigate global attention redundancy and efficiently capture high-order attention, advancing diffusion transformer design.
Findings
PiT-L achieves 54% FID improvement over DiT-XL/2.
Proposed methods reduce computational cost while maintaining high performance.
Extensive experiments validate the effectiveness of PiT architecture.
Abstract
Diffusion Transformers (DiTs) achieve remarkable performance within image generation via the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global modeling transformers, which face significant quadratic computational cost. However, through empirical analysis, we find that DiTs do not rely as heavily on global information as previously believed. In fact, most layers exhibit significant redundancy in global computation. Additionally, conventional attention mechanisms suffer from low-frequency inertia, limiting their efficiency. To address these issues, we propose Pseudo Shifted Window Attention (PSWA), which fundamentally mitigates global attention redundancy. PSWA achieves moderate global-local information through window attention. It further utilizes a high-frequency bridging branch to simulate shifted window operations, which both enrich the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
