Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service
Xianzhe Zheng, Zhengheng Wang, Ruiyan Ma, Rui Wang, Xiyu Wang, Rui Chen, Peng Zhang, Sicheng Pan, Zhangheng Huang, Chenxin Wu, Yi Zhang, Bo Cai, Kan Liu, Teng Ma, Yin Du, Dong Deng, Sai Wu, Guoyun Zhu, Wei Zhang, Feifei Li

TL;DR
This paper presents Kareto, an optimizer for multi-tiered KV cache storage in LLM services that dynamically balances cost, throughput, and latency, improving performance and efficiency over static configurations.
Contribution
It introduces Kareto, a novel multi-objective optimizer that efficiently approximates the Pareto frontier for heterogeneous storage configurations in LLM caching.
Findings
Kareto effectively adapts to workload variations.
It improves throughput by up to 9.3%.
It reduces latency by up to 58.3%.
Abstract
The memory-for-computation paradigm of KV caching is essential for accelerating large language model (LLM) inference service, but limited GPU high-bandwidth memory (HBM) capacity motivates offloading the KV cache to cheaper external storage tiers. While this expands capacity, it introduces the challenge of dynamically managing heterogeneous storage resources to balance cost, throughput, and latency under varying workloads. We formulate this as a multi-objective optimization problem: identifying the Pareto frontier across these metrics within the storage configuration space. Using a high-fidelity end-to-end simulator, we observe that the objective functions are non-analytic and exhibit complex variable coupling, making the Pareto frontier difficult to approximate analytically. To obtain the frontier, we introduce Kareto, a KV-cache Adaptive REsource managemenT Optimizer. Kareto leverages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Big Data and Digital Economy
