Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
Sanjeev Rao Ganjihal

TL;DR
This paper introduces a unified, multi-tier memory management system for KV caches in large-scale GPU inference, significantly improving efficiency, capacity, and cost-effectiveness.
Contribution
It presents a novel architecture-variant-aware sizing engine, a multi-tier hierarchy, and a Bayesian reuse predictor to optimize KV cache management in GPU inference.
Findings
Achieves up to 7.4x higher batch sizes.
Extends cache capacity from 40 GB to over 38 TB per node.
Projects 1.4-2.1x reduction in time-to-first-token and 47% cost savings.
Abstract
Key-value (KV) cache memory management is the primary bottleneck limiting throughput and cost-efficiency in large-scale GPU inference serving. Current systems suffer from three compounding inefficiencies: (1) the absence of unified KV cache sizing across all attention architectures--particularly multi-head latent attention (MLA), which is unsupported in general-purpose frameworks, resulting in up to 57x memory over-provisioning; (2) confinement of KV cache to a single memory tier (GPU HBM) despite the availability of a rich hierarchy spanning CPU DRAM, CXL-attached memory, NVMe via GPUDirect Storage, RDMA fabric, and parallel filesystems; and (3) reactive eviction policies that discard reusable state, forcing redundant recomputation. We present a unified system that addresses all three problems. Our architecture-variant-aware sizing engine computes exact memory requirements per…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
