KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation

Chaoyi Jiang; Lei Gao; Hossein Entezari Zarch; Murali Annavaram

arXiv:2411.17089·cs.LG·June 5, 2025

KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation

Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram

PDF

Open Access 1 Repo

TL;DR

KVPR is a novel method that overlaps partial KV cache recomputation with data transfer to improve LLM inference efficiency, reducing latency and increasing throughput by up to 46%.

Contribution

It introduces an I/O-aware inference technique that efficiently overlaps KV cache transfer and recomputation, surpassing existing methods in performance.

Findings

01

Up to 35.8% lower latency during decoding.

02

46.2% higher throughput compared to state-of-the-art.

03

Effective automation with profiling, scheduling, and runtime modules.

Abstract

Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) cache is used to store intermediate activations, which significantly lowers the computational overhead for token generation. However, the memory required for the KV cache grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure, but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU. Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution, but they are hindered by excessive data movement and dependence on CPU capabilities. Fully overlapping PCIe communication latency gets challenging as the size of the KV cache grows and/or the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chaoyij/kvpr
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression · Caching and Content Delivery

MethodsSparse Evolutionary Training