CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands
Weiye Wang (1), Chen Chen (1), Junxue Zhang (2), Zhusheng Wang (3), Hui Yuan (3), Zixuan Guan (3), Xiaolong Zheng (3), Qizhen Weng (4), Yin Chen (4), Minyi Guo (1) ((1) Shanghai Jiao Tong University, (2) University of Science, Technology of China, (3) Huawei

TL;DR
CALVO is a novel LLM serving engine that improves efficiency by treating KVCache loading as a first-class concern, decoupling it from GPU computation, and explicitly accounting for loading delays in scheduling, leading to significant performance gains.
Contribution
CALVO introduces a new approach to LLM inference that explicitly manages KVCache loading as an independent, asynchronous stage, optimizing resource utilization and scheduling accuracy.
Findings
Achieves up to 61.67% higher SLO attainment.
Effectively decouples KVCache loading from GPU computation.
Improves network-intensive LLM inference efficiency.
Abstract
Distributed prefix caching has become a core technique for efficient LLM serving. However, for long-context requests with high cache hit ratios, retrieving reusable KVCache blocks from remote servers has emerged as a new performance bottleneck. Such network-intensive LLM inference is expected to become increasingly common as agentic AI workloads continue to grow. However, existing LLM inference engines remain largely compute-centric: they treat KVCache loading as a subordinate phase to GPU execution and often fail to account for its delay explicitly during scheduling. We present CALVO, an LLM serving engine that treats KVCache loading as a first-class concern. CALVO decouples KVCache loading and GPU computation into independently managed, asynchronously progressing stages, enabling better utilization of network, PCIe, and computation resources. In addition, CALVO incorporates KVCache…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Cloud Computing and Resource Management · Advanced Data Storage Technologies
