TL;DR
HeteroCache is a training-free, dynamic cache compression method for long-context LLM inference that categorizes attention heads and uses hierarchical storage to reduce memory and I/O overhead.
Contribution
It introduces a novel, fine-grained, dynamic compression framework that leverages attention head heterogeneity and redundancy, improving efficiency without retraining.
Findings
Achieves state-of-the-art performance on long-context benchmarks.
Accelerates decoding by up to 3 times with 224K context.
Effectively manages attention drift with hierarchical storage.
Abstract
The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important information. Although recent dynamic retrieval approaches attempt to address this issue, they typically suffer from coarse-grained caching strategies and incur high I/O overhead. To overcome these limitations, we propose HeteroCache, a training-free dynamic compression framework. Our method is built on two key insights: attention heads exhibit diverse temporal heterogeneity, and there is significant spatial redundancy among heads within the same layer. Guided by these insights, HeteroCache categorizes heads based on stability and similarity, applying a fine-grained weighting strategy that allocates larger cache budgets to heads with rapidly shifting attention to capture context changes. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
