LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management
Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo,, Junping Zhao, Ke Zhang, Zhenxuan Pan

TL;DR
LayerKV is a plug-in method that reduces latency in large language model serving by layer-wise cache management and scheduling, significantly improving response times and SLO adherence without extra hardware.
Contribution
It introduces layer-wise KV cache management and an SLO-aware scheduler to optimize latency and resource utilization in LLM serving.
Findings
TTFT latency improved up to 69x
SLO violation rates reduced by 28.7%
Effective across models from 7B to 70B parameters
Abstract
The expanding context windows in large language models (LLMs) have greatly enhanced their capabilities in various applications, but they also introduce significant challenges in maintaining low latency, particularly in Time to First Token (TTFT). This paper identifies that the sharp rise in TTFT as context length increases is predominantly driven by queuing delays, which are caused by the growing demands for GPU Key-Value (KV) cache allocation clashing with the limited availability of KV cache blocks. To address this issue, we propose LayerKV, a simple yet effective plug-in method that effectively reduces TTFT without requiring additional hardware or compromising output performance, while seamlessly integrating with existing parallelism strategies and scheduling techniques. Specifically, LayerKV introduces layer-wise KV block allocation, management, and offloading for fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization
Methodstravel james
