Temporal Aware Pruning for Efficient Diffusion-based Video Generation
Sheng Li, Yang Sui, Junhao Ran, Bo Yuan, Yue Dai, Xulong Tang

TL;DR
TAPE introduces a novel, training-free method for temporally coherent token pruning in diffusion-based video generation, significantly improving speed without sacrificing quality.
Contribution
It proposes TAPE, a temporal smoothing and reselection technique for token pruning that maintains coherence and quality in video diffusion models.
Findings
TAPE achieves substantial speedups in video generation.
TAPE preserves high visual fidelity compared to prior methods.
TAPE outperforms existing token reduction approaches.
Abstract
Video diffusion models have recently enabled high-quality video generation with ViT-based architectures, but remain computationally intensive because generation requires attention computation over long spatiotemporal sequences. Token pruning has proven effective for ViTs and VLMs. However, most prior pruning methods are attention-based and operate per frame, failing to ensure the vital temporal coherence across frames in video generation tasks. In practice, naively adopting attention-only pruning causes noticeable degradation due to worsened background consistency, flickering, and reduced image quality. To address this, we propose TAPE, a training-free Temporal Aware Pruning for Efficient diffusion-based video generation. TAPE (i) applies temporal smoothing to align token-importance across adjacent frames and suppress selection jitter; and (ii) performs token reselection in selected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
