HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

Zhiyuan Shi; Qibo Qiu; Feng Xue; Zhonglin Jiang; Li Yu; Jian Jiang; Xiaofei He; Wenxiao Wang

arXiv:2601.13684·cs.CL·April 21, 2026

HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

Zhiyuan Shi, Qibo Qiu, Feng Xue, Zhonglin Jiang, Li Yu, Jian Jiang, Xiaofei He, Wenxiao Wang

PDF

1 Repo

TL;DR

HeteroCache is a training-free, dynamic cache compression method for long-context LLM inference that categorizes attention heads and uses hierarchical storage to reduce memory and I/O overhead.

Contribution

It introduces a novel, fine-grained, dynamic compression framework that leverages attention head heterogeneity and redundancy, improving efficiency without retraining.

Findings

01

Achieves state-of-the-art performance on long-context benchmarks.

02

Accelerates decoding by up to 3 times with 224K context.

03

Effectively manages attention drift with hierarchical storage.

Abstract

The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important information. Although recent dynamic retrieval approaches attempt to address this issue, they typically suffer from coarse-grained caching strategies and incur high I/O overhead. To overcome these limitations, we propose HeteroCache, a training-free dynamic compression framework. Our method is built on two key insights: attention heads exhibit diverse temporal heterogeneity, and there is significant spatial redundancy among heads within the same layer. Guided by these insights, HeteroCache categorizes heads based on stability and similarity, applying a fine-grained weighting strategy that allocates larger cache budgets to heads with rapidly shifting attention to capture context changes. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ponytaill/HeteroCache
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.