VSA: Faster Video Diffusion with Trainable Sparse Attention
Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, Hao Zhang

TL;DR
VSA introduces a trainable sparse attention mechanism for video diffusion transformers, significantly reducing computational costs while maintaining performance, enabling faster training and inference for large-scale models.
Contribution
The paper presents VSA, a novel trainable sparse attention method that replaces full attention in video diffusion transformers, improving efficiency without sacrificing accuracy.
Findings
Reduces training FLOPS by 2.53× with no loss in diffusion quality.
Speeds up attention computation by 6× on open-source models.
Cuts end-to-end generation time from 31s to 18s with comparable quality.
Abstract
Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Advanced Neuroimaging Techniques and Applications
MethodsSoftmax · Attention Is All You Need · Diffusion
