DiTFastAttn: Attention Compression for Diffusion Transformer Models
Zhihang Yuan, Hanling Zhang, Pu Lu, Xuefei Ning, Linfeng Zhang,, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang

TL;DR
DiTFastAttn is a post-training compression technique that reduces the computational complexity of Diffusion Transformers by exploiting redundancies in attention, leading to significant speedups in image and video generation.
Contribution
The paper introduces DiTFastAttn, a novel method that compresses attention in Diffusion Transformers by identifying and reducing key redundancies during inference.
Findings
Reduces up to 76% of attention FLOPs in image generation.
Achieves up to 1.8x speedup at 2k x 2k resolution.
Effective across multiple models and generation tasks.
Abstract
Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications
MethodsFocus
