Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading

William Meng (1; 2); Benjamin Lee (1); Hong Wang (2) ((1) University of Pennsylvania; (2) Intel)

arXiv:2601.19910·cs.AR·January 29, 2026

Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading

William Meng (1, 2), Benjamin Lee (1), Hong Wang (2) ((1) University of Pennsylvania, (2) Intel)

PDF

Open Access

TL;DR

This paper analyzes the bottlenecks in serving large language model inference with KV cache offloading, revealing that PCIe bandwidth limits cause significant delays and proposing optimizations for hardware and scheduling.

Contribution

It develops an analytical framework to identify memory-bound thresholds and proposes optimizations for hardware interconnects, model architectures, and scheduling.

Findings

01

99% of latency from data transfers and request serving

02

GPUs consume only 28% of rated TDP during offloading

03

typical workloads exceed critical token ratio by orders of magnitude

Abstract

KV cache offloading enables long-context LLM inference by storing caches in CPU DRAM, but PCIe bandwidth limitations create severe bottlenecks. In this paper, we develops an analytical framework that derives $κ_{crit}$ , the critical cached-to-prefill token ratio where execution becomes memory-bound and show typical workloads exceed this threshold by orders of magnitude. Empirical characterization reveals 99\% of latency spent on transfers and serving offloaded requests results in GPU's consuming only 28\% of their rated TDP, motivating our proposed optimizations for hardware interconnects, model architectures, and scheduling algorithms.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Advanced Data Storage Technologies