VSA: Faster Video Diffusion with Trainable Sparse Attention

Peiyuan Zhang; Yongqi Chen; Haofeng Huang; Will Lin; Zhengzhong Liu; Ion Stoica; Eric Xing; Hao Zhang

arXiv:2505.13389·cs.CV·October 29, 2025

VSA: Faster Video Diffusion with Trainable Sparse Attention

Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, Hao Zhang

PDF

Open Access 1 Repo 4 Models 2 Datasets

TL;DR

VSA introduces a trainable sparse attention mechanism for video diffusion transformers, significantly reducing computational costs while maintaining performance, enabling faster training and inference for large-scale models.

Contribution

The paper presents VSA, a novel trainable sparse attention method that replaces full attention in video diffusion transformers, improving efficiency without sacrificing accuracy.

Findings

01

Reduces training FLOPS by 2.53× with no loss in diffusion quality.

02

Speeds up attention computation by 6× on open-source models.

03

Cuts end-to-end generation time from 31s to 18s with comparable quality.

Abstract

Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53 $\times$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hao-ai-lab/fastvideo
pytorch

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Advanced Neuroimaging Techniques and Applications

MethodsSoftmax · Attention Is All You Need · Diffusion