TL;DR
This paper empirically evaluates 33 quantization and cache policies for KV-cache compression in self-forcing video generation, aiming to improve memory efficiency without sacrificing quality.
Contribution
It provides a comprehensive empirical analysis of KV-cache compression methods, identifying practical approaches like FlowCache-inspired soft-prune INT4 for better memory and performance balance.
Findings
FlowCache-inspired soft-prune INT4 achieves 5.42-5.49x compression with reduced VRAM.
High-fidelity methods like PRQ_INT4 and QUAROT_KV_INT4 are costly in runtime or memory.
Nominal compression methods still exceed peak VRAM due to current integration practices.
Abstract
Self-forcing video generation extends a short-horizon video model to longer rollouts by repeatedly feeding generated content back in as context. This scaling path immediately exposes a systems bottleneck: the key-value (KV) cache grows with rollout length, so longer videos require not only better generation quality but also substantially better memory behavior. We present a comprehensive empirical study of KV-cache compression for self-forcing video generation on a Wan2.1-based Self-Forcing stack. Our study covers 33 quantization and cache-policy variants, 610 prompt-level observations, and 63 benchmark-level summaries across two evaluation settings: MovieGen for single-shot 10-second generation and StoryEval for longer narrative-style stability. We jointly evaluate peak VRAM, runtime, realized compression ratio, VBench imaging quality, BF16-referenced fidelity (SSIM, LPIPS, PSNR), and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
