HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

Jonathan Cederlund; Axel Berg; Durmus Alp Emre Acar; Chuteng Zhou; Pontus Giselsson

arXiv:2605.14877·cs.CV·May 15, 2026

HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

Jonathan Cederlund, Axel Berg, Durmus Alp Emre Acar, Chuteng Zhou, Pontus Giselsson

PDF

TL;DR

HeatKV introduces a head-tuned KV-cache compression method for visual autoregressive models, significantly reducing memory usage while maintaining high image quality and prompt alignment.

Contribution

The paper presents a novel static pruning schedule based on attention head importance, achieving state-of-the-art KV-cache compression in VAR models.

Findings

01

HeatKV doubles the compression ratio compared to existing methods.

02

Maintains similar or better image fidelity and human perception scores.

03

Achieves state-of-the-art KV-cache compression for VAR models.

Abstract

Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.