IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

Yuzhen Mao; Qitong Wang; Martin Ester; Ke Li

arXiv:2604.10539·cs.LG·April 14, 2026

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

Yuzhen Mao, Qitong Wang, Martin Ester, Ke Li

PDF

2 Repos 1 Video

TL;DR

IceCache introduces a memory-efficient KV-cache management strategy for long-sequence LLMs, combining semantic token clustering with PagedAttention to reduce memory usage while maintaining high accuracy.

Contribution

The paper presents a novel hierarchical KV cache management approach that improves memory efficiency and performance in long-sequence LLM inference.

Findings

01

IceCache maintains 99% accuracy with only 25% of KV cache tokens.

02

It outperforms existing offloading methods in latency and accuracy on LongBench.

03

IceCache reduces memory footprint significantly while preserving model performance.

Abstract

Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

IceCache: Memory-Efficient KV-cache Management for Long-Sequence LLMs· slideslive