USV: Unified Sparsification for Accelerating Video Diffusion Models

Xinjian Wu; Hongmei Wang; Yuan Zhou; Qinglin Lu

arXiv:2512.05754·cs.CV·December 8, 2025

USV: Unified Sparsification for Accelerating Video Diffusion Models

Xinjian Wu, Hongmei Wang, Yuan Zhou, Qinglin Lu

PDF

Open Access

TL;DR

USV introduces a unified, trainable sparsification framework that jointly optimizes attention pruning, token merging, and denoising steps, significantly accelerating video diffusion models while preserving quality.

Contribution

It presents the first end-to-end trainable system that co-optimizes multiple sparsification strategies for video diffusion models, achieving substantial speedups.

Findings

01

Up to 83.3% speedup in denoising process

02

22.7% end-to-end acceleration

03

Maintains high visual fidelity

Abstract

The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising trajectories. Existing accelerators -- such as sparse attention and step-distilled samplers -- typically target a single dimension in isolation and quickly encounter diminishing returns, as the remaining bottlenecks become dominant. In this work, we introduce USV (Unified Sparsification for Video diffusion models), an end-to-end trainable framework that overcomes this limitation by jointly orchestrating sparsification across both the model's internal computation and its sampling process. USV learns a dynamic, data- and timestep-dependent sparsification policy that prunes redundant attention connections, adaptively merges semantically similar tokens, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment · Image Enhancement Techniques