Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li,, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt, Keutzer, Song Han

TL;DR
Sparse VideoGen significantly accelerates video diffusion transformers by exploiting inherent spatial-temporal sparsity in attention mechanisms, achieving over twofold speedup without sacrificing quality.
Contribution
It introduces a training-free framework that dynamically identifies sparse attention patterns and optimizes hardware implementation for efficient video generation.
Findings
Achieves up to 2.33x speedup on benchmark models.
Maintains high video quality despite acceleration.
Provides open-source code for practical use.
Abstract
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Optical Imaging Technologies · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging
MethodsSoftmax · Attention Is All You Need
