KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache   Generation

Minsik Cho; Mohammad Rastegari; Devang Naik

arXiv:2405.05329·cs.DC·May 15, 2024

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Minsik Cho, Mohammad Rastegari, Devang Naik

PDF

Open Access

TL;DR

KV-Runahead is a novel parallelization method that accelerates the prompt phase of large language model inference by efficiently populating the key-value cache, significantly reducing time-to-first-token and improving speed.

Contribution

The paper introduces KV-Runahead, a scalable parallelization scheme that accelerates LLM prompt inference by parallel cache generation and load balancing, with easy implementation and reduced computation.

Findings

01

Over 1.4x speedup for Llama 7B

02

Over 1.6x speedup for Falcon 7B

03

Efficient parallel cache population reduces inference latency

Abstract

Large Language Model or LLM inference has two phases, the prompt (or prefill) phase to output the first token and the extension (or decoding) phase to the generate subsequent tokens. In this work, we propose an efficient parallelization scheme, KV-Runahead to accelerate the prompt phase. The key observation is that the extension phase generates tokens faster than the prompt phase because of key-value cache (KV-cache). Hence, KV-Runahead parallelizes the prompt phase by orchestrating multiple processes to populate the KV-cache and minimizes the time-to-first-token (TTFT). Dual-purposing the KV-cache scheme has two main benefits. First, since KV-cache is designed to leverage the causal attention map, we minimize computation and computation automatically. Second, since it already exists for the extension phase, KV-Runahead is easy to implement. We further propose context-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management

MethodsLLaMA