Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile
Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng,, Ion Stoica, Hao Zhang

TL;DR
Efficient-vDiT introduces a sparse attention mechanism and multi-step distillation to significantly accelerate video diffusion transformer inference, reducing computation time with minimal quality loss.
Contribution
This paper proposes a novel sparse 3D attention method and a multi-step distillation approach to improve the efficiency of video diffusion transformers.
Findings
Achieves 7.4x-7.8x faster inference on 29 and 93 frame videos.
Reduces inference time by leveraging sparse attention and distillation.
Maintains comparable video quality with minimal performance trade-offs.
Abstract
Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Advanced Optical Imaging Technologies · Neural Networks and Reservoir Computing
MethodsSoftmax · Attention Is All You Need · Diffusion
