KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram

TL;DR
KVPR is a novel method that overlaps partial KV cache recomputation with data transfer to improve LLM inference efficiency, reducing latency and increasing throughput by up to 46%.
Contribution
It introduces an I/O-aware inference technique that efficiently overlaps KV cache transfer and recomputation, surpassing existing methods in performance.
Findings
Up to 35.8% lower latency during decoding.
46.2% higher throughput compared to state-of-the-art.
Effective automation with profiling, scheduling, and runtime modules.
Abstract
Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) cache is used to store intermediate activations, which significantly lowers the computational overhead for token generation. However, the memory required for the KV cache grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure, but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU. Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution, but they are hindered by excessive data movement and dependence on CPU capabilities. Fully overlapping PCIe communication latency gets challenging as the size of the KV cache grows and/or the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression · Caching and Content Delivery
MethodsSparse Evolutionary Training
