Efficient Long-Context LLM Inference via KV Cache Clustering
Jie Hu, Shengnan Wang, Yutong He, Ping Gong, Jiawei Yi, Juncheng Zhang, Youhui Bai, Renhai Chen, Gong Zhang, Cheng Li, Kun Yuan

TL;DR
Chelsea introduces an online KV cache clustering framework for long-context LLMs that significantly reduces memory usage and accelerates inference without sacrificing model performance.
Contribution
The paper proposes a novel clustering method for KV caches in LLMs, enabling efficient memory use and faster inference through Chunked Soft Matching and cache merging.
Findings
Up to 80% reduction in KV cache memory usage.
Inference acceleration by up to 3.19 times.
Latency reduction of up to 2.72 times.
Abstract
Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational overhead. In this paper, we introduce Chelsea, a simple yet effective framework for online KV cache clustering. Our approach is based on the observation that key states exhibit high similarity along the sequence dimension. To enable efficient clustering, we divide the sequence into chunks and propose Chunked Soft Matching, which employs an alternating partition strategy within each chunk and identifies clusters based on similarity. Chelsea then merges the KV cache within each cluster into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Packet Processing and Optimization · Algorithms and Data Compression · Caching and Content Delivery
