Hydragen: High-Throughput LLM Inference with Shared Prefixes
Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu,, Christopher R\'e, Azalia Mirhoseini

TL;DR
Hydragen introduces a hardware-aware, exact attention implementation that efficiently handles shared prefixes in large language model inference, significantly boosting throughput and enabling longer contexts.
Contribution
Hydragen presents a novel decomposition of attention for shared prefixes, enabling hardware-efficient batching and substantial speedups in LLM inference.
Findings
Up to 32x throughput improvement over baselines.
Maintains high throughput with longer shared contexts, e.g., 16K tokens.
Reduces inference time on programming tasks by 55%.
Abstract
Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decoding in this large-batch setting can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention with shared prefixes. Hydragen computes attention over the shared prefix and unique suffixes separately. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Our method can improve end-to-end CodeLlama-13b…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing
