RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov

TL;DR
RocketKV introduces a two-stage, training-free KV cache compression method for long-context LLM inference, achieving significant speedups and memory savings with minimal accuracy loss.
Contribution
It proposes a novel two-stage KV cache compression strategy that significantly reduces memory and computation during long-context LLM decoding without retraining.
Findings
Up to 400× compression ratio achieved.
End-to-end speedup of up to 3.7× on NVIDIA A100.
Peak memory reduction of up to 32.6%. in the decode phase.
Abstract
Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400, end-to-end speedup of up to 3.7 as well as peak memory reduction of up to 32.6% in the decode…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNetwork Packet Processing and Optimization · Algorithms and Data Compression · Advanced Data Storage Technologies
MethodsSoftmax · Attention Is All You Need
