Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
William Meng (1, 2), Benjamin Lee (1), Hong Wang (2) ((1) University of Pennsylvania, (2) Intel)

TL;DR
This paper analyzes the bottlenecks in serving large language model inference with KV cache offloading, revealing that PCIe bandwidth limits cause significant delays and proposing optimizations for hardware and scheduling.
Contribution
It develops an analytical framework to identify memory-bound thresholds and proposes optimizations for hardware interconnects, model architectures, and scheduling.
Findings
99% of latency from data transfers and request serving
GPUs consume only 28% of rated TDP during offloading
typical workloads exceed critical token ratio by orders of magnitude
Abstract
KV cache offloading enables long-context LLM inference by storing caches in CPU DRAM, but PCIe bandwidth limitations create severe bottlenecks. In this paper, we develops an analytical framework that derives , the critical cached-to-prefill token ratio where execution becomes memory-bound and show typical workloads exceed this threshold by orders of magnitude. Empirical characterization reveals 99\% of latency spent on transfers and serving offloaded requests results in GPU's consuming only 28\% of their rated TDP, motivating our proposed optimizations for hardware interconnects, model architectures, and scheduling algorithms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Advanced Data Storage Technologies
