HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling
Jonathan Cederlund, Axel Berg, Durmus Alp Emre Acar, Chuteng Zhou, Pontus Giselsson

TL;DR
HeatKV introduces a head-tuned KV-cache compression method for visual autoregressive models, significantly reducing memory usage while maintaining high image quality and prompt alignment.
Contribution
The paper presents a novel static pruning schedule based on attention head importance, achieving state-of-the-art KV-cache compression in VAR models.
Findings
HeatKV doubles the compression ratio compared to existing methods.
Maintains similar or better image fidelity and human perception scores.
Achieves state-of-the-art KV-cache compression for VAR models.
Abstract
Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
