TL;DR
This paper introduces Forcing-KV, a hybrid cache compression method for autoregressive video diffusion models that reduces memory usage and accelerates generation without sacrificing quality.
Contribution
It presents a novel head-wise functional analysis and a hybrid pruning strategy to optimize KV cache compression in AR video diffusion models.
Findings
Achieves over 29 fps generation speed on a single GPU.
Reduces cache memory by 30%.
Provides up to 2.82x speedup at higher resolutions.
Abstract
Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
