Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
Boxun Xu, Yuming Du, Zichang Liu, Siyu Yang, Ziyang Jiang, Siqi Yan, Rajasi Saha, Albert Pumarola, Wenchen Wang, Peng Li

TL;DR
Sparse Forcing introduces a trainable sparse attention mechanism for autoregressive video diffusion models, enhancing long-horizon generation quality and reducing decoding latency through efficient GPU kernels and persistent spatiotemporal memory.
Contribution
It proposes a novel trainable sparsity mechanism and an efficient GPU kernel, Persistent Block-Sparse Attention, for scalable, low-latency autoregressive video generation.
Findings
Improves VBench score by +0.26 over Self-Forcing on 5-second videos.
Achieves 1.11-1.17x decoding speedup and 42% lower KV-cache footprint.
Enhances long-horizon video quality with up to +2.74 VBench improvements on 1-minute generations.
Abstract
We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing is motivated by an empirical observation in autoregressive diffusion rollouts: attention concentrates on a persistent subset of salient visual blocks, forming an implicit spatiotemporal memory in the KV cache, and exhibits a locally structured block-sparse pattern within sliding windows. Building on this observation, we propose a trainable native sparsity mechanism that learns to compress, preserve, and update these persistent blocks while restricting computation within each local window to a dynamically selected local neighborhood. To make the approach practical at scale for both training and inference, we further propose Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
