RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
Junkai Zhang, Hang Guo, Luca Benini, Yawei Li

TL;DR
This paper introduces RDKV, a joint rate-distortion approach for KV cache compression in large language models, optimizing eviction and quantization simultaneously to improve inference efficiency.
Contribution
It formulates KV cache compression as a rate-distortion problem and proposes RDKV, the first method to jointly optimize eviction and quantization for LLM inference.
Findings
RDKV outperforms baselines by 9.1% on average.
Recovers 97.81% of full-cache accuracy with only 2.48% cache retention.
Achieves 4.5x speedup and 1.9x memory reduction at 128K context length.
Abstract
Large language models (LLMs) have shown strong performance across diverse tasks, but their inference with long input contexts is bottlenecked by memory size and bandwidth. The Key-Value (KV) cache size grows linearly with sequence length and needs to be re-read from off-chip high-bandwidth memory (HBM) to on-chip memory at every decoding step, resulting in memory-bound inference. Existing methods reduce the cache by either eviction or quantization, but typically treat the two in isolation. In this paper, we cast KV cache compression as a rate-distortion problem, under which eviction and quantization are two end-points of the same bit allocation scheme. This exposes the need to optimize them jointly, motivating our method, RDKV (Rate-Distortion KV cache compression). RDKV derives the weight of each token or channel from the distortion that compression induces on the attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
