CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang,, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan,, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang

TL;DR
CacheGen is a novel system that compresses and streams KV caches for large language models, significantly reducing bandwidth and delay during context loading without sacrificing response quality.
Contribution
CacheGen introduces a custom tensor encoder and adaptive compression for KV caches, enabling faster and more bandwidth-efficient large language model serving.
Findings
Reduces KV cache size by 3.5-4.3x
Decreases total context-fetching delay by 3.2-3.7x
Maintains high generation quality
Abstract
As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache's distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
