Efficient-vDiT: Efficient Video Diffusion Transformers With Attention   Tile

Hangliang Ding; Dacheng Li; Runlong Su; Peiyuan Zhang; Zhijie Deng,; Ion Stoica; Hao Zhang

arXiv:2502.06155·cs.CV·February 18, 2025

Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng,, Ion Stoica, Hao Zhang

PDF

Open Access 1 Repo

TL;DR

Efficient-vDiT introduces a sparse attention mechanism and multi-step distillation to significantly accelerate video diffusion transformer inference, reducing computation time with minimal quality loss.

Contribution

This paper proposes a novel sparse 3D attention method and a multi-step distillation approach to improve the efficiency of video diffusion transformers.

Findings

01

Achieves 7.4x-7.8x faster inference on 29 and 93 frame videos.

02

Reduces inference time by leveraging sparse attention and distillation.

03

Maintains comparable video quality with minimal performance trade-offs.

Abstract

Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hao-ai-lab/fastvideo
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Advanced Optical Imaging Technologies · Neural Networks and Reservoir Computing

MethodsSoftmax · Attention Is All You Need · Diffusion