InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim

TL;DR
InfiniGen introduces a dynamic KV cache management framework for large language models that reduces memory overhead and improves inference efficiency during long-text generation by selectively prefetching essential cache entries.
Contribution
The paper proposes a novel KV cache management method that enhances offloading-based LLM inference systems by selectively prefetching critical cache entries, improving performance and accuracy.
Findings
Up to 3.00x performance improvement over prior methods
Better model accuracy with efficient cache management
Effective handling of long-text generation in LLMs
Abstract
Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. Serving LLM inference for generating long contents, however, poses a challenge due to the enormous memory footprint of the transient state, known as the key-value (KV) cache, which scales with the sequence length and batch size. In this paper, we present InfiniGen, a novel KV cache management framework tailored for long-text generation, which synergistically works with modern offloading-based inference systems. InfiniGen leverages the key insight that a few important tokens that are essential for computing the subsequent attention layer in the Transformer can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. This allows us to prefetch only the essential KV…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Algorithms and Data Compression · Natural Language Processing Techniques
MethodsAttention Is All You Need · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer
