KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
Minsik Cho, Mohammad Rastegari, Devang Naik

TL;DR
KV-Runahead is a novel parallelization method that accelerates the prompt phase of large language model inference by efficiently populating the key-value cache, significantly reducing time-to-first-token and improving speed.
Contribution
The paper introduces KV-Runahead, a scalable parallelization scheme that accelerates LLM prompt inference by parallel cache generation and load balancing, with easy implementation and reduced computation.
Findings
Over 1.4x speedup for Llama 7B
Over 1.6x speedup for Falcon 7B
Efficient parallel cache population reduces inference latency
Abstract
Large Language Model or LLM inference has two phases, the prompt (or prefill) phase to output the first token and the extension (or decoding) phase to the generate subsequent tokens. In this work, we propose an efficient parallelization scheme, KV-Runahead to accelerate the prompt phase. The key observation is that the extension phase generates tokens faster than the prompt phase because of key-value cache (KV-cache). Hence, KV-Runahead parallelizes the prompt phase by orchestrating multiple processes to populate the KV-cache and minimizes the time-to-first-token (TTFT). Dual-purposing the KV-cache scheme has two main benefits. First, since KV-cache is designed to leverage the causal attention map, we minimize computation and computation automatically. Second, since it already exists for the extension phase, KV-Runahead is easy to implement. We further propose context-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
MethodsLLaMA
