Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

Boxun Xu; Yuming Du; Zichang Liu; Siyu Yang; Ziyang Jiang; Siqi Yan; Rajasi Saha; Albert Pumarola; Wenchen Wang; Peng Li

arXiv:2604.21221·cs.CV·April 24, 2026

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

Boxun Xu, Yuming Du, Zichang Liu, Siyu Yang, Ziyang Jiang, Siqi Yan, Rajasi Saha, Albert Pumarola, Wenchen Wang, Peng Li

PDF

TL;DR

Sparse Forcing introduces a trainable sparse attention mechanism for autoregressive video diffusion models, enhancing long-horizon generation quality and reducing decoding latency through efficient GPU kernels and persistent spatiotemporal memory.

Contribution

It proposes a novel trainable sparsity mechanism and an efficient GPU kernel, Persistent Block-Sparse Attention, for scalable, low-latency autoregressive video generation.

Findings

01

Improves VBench score by +0.26 over Self-Forcing on 5-second videos.

02

Achieves 1.11-1.17x decoding speedup and 42% lower KV-cache footprint.

03

Enhances long-horizon video quality with up to +2.74 VBench improvements on 1-minute generations.

Abstract

We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing is motivated by an empirical observation in autoregressive diffusion rollouts: attention concentrates on a persistent subset of salient visual blocks, forming an implicit spatiotemporal memory in the KV cache, and exhibits a locally structured block-sparse pattern within sliding windows. Building on this observation, we propose a trainable native sparsity mechanism that learns to compress, preserve, and update these persistent blocks while restricting computation within each local window to a dynamically selected local neighborhood. To make the approach practical at scale for both training and inference, we further propose Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.