USV: Unified Sparsification for Accelerating Video Diffusion Models
Xinjian Wu, Hongmei Wang, Yuan Zhou, Qinglin Lu

TL;DR
USV introduces a unified, trainable sparsification framework that jointly optimizes attention pruning, token merging, and denoising steps, significantly accelerating video diffusion models while preserving quality.
Contribution
It presents the first end-to-end trainable system that co-optimizes multiple sparsification strategies for video diffusion models, achieving substantial speedups.
Findings
Up to 83.3% speedup in denoising process
22.7% end-to-end acceleration
Maintains high visual fidelity
Abstract
The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising trajectories. Existing accelerators -- such as sparse attention and step-distilled samplers -- typically target a single dimension in isolation and quickly encounter diminishing returns, as the remaining bottlenecks become dominant. In this work, we introduce USV (Unified Sparsification for Video diffusion models), an end-to-end trainable framework that overcomes this limitation by jointly orchestrating sparsification across both the model's internal computation and its sampling process. USV learns a dynamic, data- and timestep-dependent sparsification policy that prunes redundant attention connections, adaptively merges semantically similar tokens, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment · Image Enhancement Techniques
