CacheGen: KV Cache Compression and Streaming for Fast Large Language   Model Serving

Yuhan Liu; Hanchen Li; Yihua Cheng; Siddhant Ray; Yuyang Huang,; Qizheng Zhang; Kuntai Du; Jiayi Yao; Shan Lu; Ganesh Ananthanarayanan,; Michael Maire; Henry Hoffmann; Ari Holtzman; Junchen Jiang

arXiv:2310.07240·cs.NI·July 23, 2024

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang,, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan,, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang

PDF

Open Access 2 Repos

TL;DR

CacheGen is a novel system that compresses and streams KV caches for large language models, significantly reducing bandwidth and delay during context loading without sacrificing response quality.

Contribution

CacheGen introduces a custom tensor encoder and adaptive compression for KV caches, enabling faster and more bandwidth-efficient large language model serving.

Findings

01

Reduces KV cache size by 3.5-4.3x

02

Decreases total context-fetching delay by 3.2-3.7x

03

Maintains high generation quality

Abstract

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache's distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis