RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Payman Behnam; Yaosheng Fu; Ritchie Zhao; Po-An Tsai; Zhiding Yu; Alexey Tumanov

arXiv:2502.14051·cs.CL·August 14, 2025

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov

PDF

Open Access 1 Video

TL;DR

RocketKV introduces a two-stage, training-free KV cache compression method for long-context LLM inference, achieving significant speedups and memory savings with minimal accuracy loss.

Contribution

It proposes a novel two-stage KV cache compression strategy that significantly reduces memory and computation during long-context LLM decoding without retraining.

Findings

01

Up to 400× compression ratio achieved.

02

End-to-end speedup of up to 3.7× on NVIDIA A100.

03

Peak memory reduction of up to 32.6%. in the decode phase.

Abstract

Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400 $\times$ , end-to-end speedup of up to 3.7 $\times$ as well as peak memory reduction of up to 32.6% in the decode…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression· slideslive

Taxonomy

TopicsNetwork Packet Processing and Optimization · Algorithms and Data Compression · Advanced Data Storage Technologies

MethodsSoftmax · Attention Is All You Need