Training-free and Adaptive Sparse Attention for Efficient Long Video Generation
Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng, Xiao, Bin Cui

TL;DR
This paper introduces AdaSpa, a novel adaptive sparse attention method for Diffusion Transformers that significantly accelerates long video generation without sacrificing quality by leveraging hierarchical sparsity and dynamic pattern search.
Contribution
The paper presents AdaSpa, the first dynamic pattern and online precise search sparse attention method that is plug-and-play, dataset-independent, and improves efficiency in long video generation.
Findings
AdaSpa achieves substantial acceleration in video generation.
It maintains high video quality with reduced computational cost.
The method seamlessly integrates with existing Diffusion Transformers.
Abstract
Generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency, primarily due to the computational demands of attention mechanisms. For instance, generating an 8-second 720p video (110K tokens) with HunyuanVideo takes about 600 PFLOPs, with around 500 PFLOPs consumed by attention computations. To address this issue, we propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method. Firstly, to realize the Dynamic Pattern, we introduce a blockified pattern to efficiently capture the hierarchical sparsity inherent in DiTs. This is based on our observation that sparse characteristics of DiTs exhibit hierarchical and blockified structures between and within different modalities. This blockified approach significantly reduces the complexity of attention computation while maintaining high fidelity in the generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
