KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study

Suraj Ranganath; Vaishak Menon; and Anish Patnaik

arXiv:2603.27469·cs.LG·March 31, 2026

KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study

Suraj Ranganath, Vaishak Menon, and Anish Patnaik

PDF

1 Repo

TL;DR

This paper empirically evaluates 33 quantization and cache policies for KV-cache compression in self-forcing video generation, aiming to improve memory efficiency without sacrificing quality.

Contribution

It provides a comprehensive empirical analysis of KV-cache compression methods, identifying practical approaches like FlowCache-inspired soft-prune INT4 for better memory and performance balance.

Findings

01

FlowCache-inspired soft-prune INT4 achieves 5.42-5.49x compression with reduced VRAM.

02

High-fidelity methods like PRQ_INT4 and QUAROT_KV_INT4 are costly in runtime or memory.

03

Nominal compression methods still exceed peak VRAM due to current integration practices.

Abstract

Self-forcing video generation extends a short-horizon video model to longer rollouts by repeatedly feeding generated content back in as context. This scaling path immediately exposes a systems bottleneck: the key-value (KV) cache grows with rollout length, so longer videos require not only better generation quality but also substantially better memory behavior. We present a comprehensive empirical study of KV-cache compression for self-forcing video generation on a Wan2.1-based Self-Forcing stack. Our study covers 33 quantization and cache-policy variants, 610 prompt-level observations, and 63 benchmark-level summaries across two evaluation settings: MovieGen for single-shot 10-second generation and StoryEval for longer narrative-style stability. We jointly evaluate peak VRAM, runtime, realized compression ratio, VBench imaging quality, BF16-referenced fidelity (SSIM, LPIPS, PSNR), and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

suraj-ranganath/kv-quant-longhorizon
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.