Efficient Long-Context LLM Inference via KV Cache Clustering

Jie Hu; Shengnan Wang; Yutong He; Ping Gong; Jiawei Yi; Juncheng Zhang; Youhui Bai; Renhai Chen; Gong Zhang; Cheng Li; Kun Yuan

arXiv:2506.11418·cs.CL·June 16, 2025

Efficient Long-Context LLM Inference via KV Cache Clustering

Jie Hu, Shengnan Wang, Yutong He, Ping Gong, Jiawei Yi, Juncheng Zhang, Youhui Bai, Renhai Chen, Gong Zhang, Cheng Li, Kun Yuan

PDF

Open Access

TL;DR

Chelsea introduces an online KV cache clustering framework for long-context LLMs that significantly reduces memory usage and accelerates inference without sacrificing model performance.

Contribution

The paper proposes a novel clustering method for KV caches in LLMs, enabling efficient memory use and faster inference through Chunked Soft Matching and cache merging.

Findings

01

Up to 80% reduction in KV cache memory usage.

02

Inference acceleration by up to 3.19 times.

03

Latency reduction of up to 2.72 times.

Abstract

Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational overhead. In this paper, we introduce Chelsea, a simple yet effective framework for online KV cache clustering. Our approach is based on the observation that key states exhibit high similarity along the sequence dimension. To enable efficient clustering, we divide the sequence into chunks and propose Chunked Soft Matching, which employs an alternating partition strategy within each chunk and identifies clusters based on similarity. Chelsea then merges the KV cache within each cluster into a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Packet Processing and Optimization · Algorithms and Data Compression · Caching and Content Delivery