TL;DR
IceCache introduces a memory-efficient KV-cache management strategy for long-sequence LLMs, combining semantic token clustering with PagedAttention to reduce memory usage while maintaining high accuracy.
Contribution
The paper presents a novel hierarchical KV cache management approach that improves memory efficiency and performance in long-sequence LLM inference.
Findings
IceCache maintains 99% accuracy with only 25% of KV cache tokens.
It outperforms existing offloading methods in latency and accuracy on LongBench.
IceCache reduces memory footprint significantly while preserving model performance.
Abstract
Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
