Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation
Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Yunming Ye

TL;DR
This paper introduces HeadKV, a head-aware key-value cache compression framework for autoregressive image generation that allocates memory based on attention head types, improving efficiency without extra training.
Contribution
It proposes a novel method to identify attention head types early and allocate cache resources accordingly, enhancing memory efficiency and generation quality.
Findings
HeadKV outperforms fixed-budget methods in memory efficiency.
The approach generalizes across different models and inputs.
Stratified Token Eviction preserves long-range information effectively.
Abstract
Autoregressive (AR) visual generation has achieved remarkable performance but suffers from high memory usage and low throughput, as it requires caching previously generated visual tokens. Recent research has shown that retaining only a few lines of cache tokens can maintain high-quality images while significantly reducing memory usage and improving throughput. However, these methods allocate a fixed budget to each attention head, overlooking the heterogeneity among attention heads, leading to suboptimal memory allocation. In this paper, we observe that attention heads across different layers exhibit diverse attention patterns, where some heads focus on local neighborhoods while others capture broader contextual dependencies. Based on this insight, we propose a novel head-aware key-value (KV) cache compression framework for autoregressive image generation, called HeadKV, which assigns…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
