LayerKV: Optimizing Large Language Model Serving with Layer-wise KV   Cache Management

Yi Xiong; Hao Wu; Changxu Shao; Ziqing Wang; Rui Zhang; Yuhong Guo,; Junping Zhao; Ke Zhang; Zhenxuan Pan

arXiv:2410.00428·cs.DC·October 10, 2024

LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management

Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo,, Junping Zhao, Ke Zhang, Zhenxuan Pan

PDF

Open Access 1 Repo

TL;DR

LayerKV is a plug-in method that reduces latency in large language model serving by layer-wise cache management and scheduling, significantly improving response times and SLO adherence without extra hardware.

Contribution

It introduces layer-wise KV cache management and an SLO-aware scheduler to optimize latency and resource utilization in LLM serving.

Findings

01

TTFT latency improved up to 69x

02

SLO violation rates reduced by 28.7%

03

Effective across models from 7B to 70B parameters

Abstract

The expanding context windows in large language models (LLMs) have greatly enhanced their capabilities in various applications, but they also introduce significant challenges in maintaining low latency, particularly in Time to First Token (TTFT). This paper identifies that the sharp rise in TTFT as context length increases is predominantly driven by queuing delays, which are caused by the growing demands for GPU Key-Value (KV) cache allocation clashing with the limited availability of KV cache blocks. To address this issue, we propose LayerKV, a simple yet effective plug-in method that effectively reduces TTFT without requiring additional hardware or compromising output performance, while seamlessly integrating with existing parallelism strategies and scheduling techniques. Specifically, LayerKV introduces layer-wise KV block allocation, management, and offloading for fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

intelligent-machine-learning/glake
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization

Methodstravel james