InfiniGen: Efficient Generative Inference of Large Language Models with   Dynamic KV Cache Management

Wonbeom Lee; Jungi Lee; Junghwan Seo; Jaewoong Sim

arXiv:2406.19707·cs.LG·July 1, 2024·1 cites

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim

PDF

Open Access 1 Repo

TL;DR

InfiniGen introduces a dynamic KV cache management framework for large language models that reduces memory overhead and improves inference efficiency during long-text generation by selectively prefetching essential cache entries.

Contribution

The paper proposes a novel KV cache management method that enhances offloading-based LLM inference systems by selectively prefetching critical cache entries, improving performance and accuracy.

Findings

01

Up to 3.00x performance improvement over prior methods

02

Better model accuracy with efficient cache management

03

Effective handling of long-text generation in LLMs

Abstract

Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. Serving LLM inference for generating long contents, however, poses a challenge due to the enormous memory footprint of the transient state, known as the key-value (KV) cache, which scales with the sequence length and batch size. In this paper, we present InfiniGen, a novel KV cache management framework tailored for long-text generation, which synergistically works with modern offloading-based inference systems. InfiniGen leverages the key insight that a few important tokens that are essential for computing the subsequent attention layer in the Transformer can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. This allows us to prefetch only the essential KV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

snu-comparch/infinigen
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Algorithms and Data Compression · Natural Language Processing Techniques

MethodsAttention Is All You Need · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer