Sparse VideoGen: Accelerating Video Diffusion Transformers with   Spatial-Temporal Sparsity

Haocheng Xi; Shuo Yang; Yilong Zhao; Chenfeng Xu; Muyang Li; Xiuyu Li,; Yujun Lin; Han Cai; Jintao Zhang; Dacheng Li; Jianfei Chen; Ion Stoica; Kurt; Keutzer; Song Han

arXiv:2502.01776·cs.CV·April 29, 2025

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li,, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt, Keutzer, Song Han

PDF

Open Access

TL;DR

Sparse VideoGen significantly accelerates video diffusion transformers by exploiting inherent spatial-temporal sparsity in attention mechanisms, achieving over twofold speedup without sacrificing quality.

Contribution

It introduces a training-free framework that dynamically identifies sparse attention patterns and optimizes hardware implementation for efficient video generation.

Findings

01

Achieves up to 2.33x speedup on benchmark models.

02

Maintains high video quality despite acceleration.

03

Provides open-source code for practical use.

Abstract

Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Optical Imaging Technologies · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging

MethodsSoftmax · Attention Is All You Need