DiTFastAttn: Attention Compression for Diffusion Transformer Models

Zhihang Yuan; Hanling Zhang; Pu Lu; Xuefei Ning; Linfeng Zhang,; Tianchen Zhao; Shengen Yan; Guohao Dai; Yu Wang

arXiv:2406.08552·cs.CV·October 21, 2024

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Zhihang Yuan, Hanling Zhang, Pu Lu, Xuefei Ning, Linfeng Zhang,, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang

PDF

Open Access 1 Video

TL;DR

DiTFastAttn is a post-training compression technique that reduces the computational complexity of Diffusion Transformers by exploiting redundancies in attention, leading to significant speedups in image and video generation.

Contribution

The paper introduces DiTFastAttn, a novel method that compresses attention in Diffusion Transformers by identifying and reducing key redundancies during inference.

Findings

01

Reduces up to 76% of attention FLOPs in image generation.

02

Achieves up to 1.8x speedup at 2k x 2k resolution.

03

Effective across multiple models and generation tasks.

Abstract

Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DiTFastAttn: Attention Compression for Diffusion Transformer Models· slideslive

Taxonomy

TopicsNeural Networks and Applications

MethodsFocus