Hydragen: High-Throughput LLM Inference with Shared Prefixes

Jordan Juravsky; Bradley Brown; Ryan Ehrlich; Daniel Y. Fu,; Christopher R\'e; Azalia Mirhoseini

arXiv:2402.05099·cs.LG·May 14, 2024·1 cites

Hydragen: High-Throughput LLM Inference with Shared Prefixes

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu,, Christopher R\'e, Azalia Mirhoseini

PDF

Open Access 1 Repo

TL;DR

Hydragen introduces a hardware-aware, exact attention implementation that efficiently handles shared prefixes in large language model inference, significantly boosting throughput and enabling longer contexts.

Contribution

Hydragen presents a novel decomposition of attention for shared prefixes, enabling hardware-efficient batching and substantial speedups in LLM inference.

Findings

01

Up to 32x throughput improvement over baselines.

02

Maintains high throughput with longer shared contexts, e.g., 16K tokens.

03

Reduces inference time on programming tasks by 55%.

Abstract

Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decoding in this large-batch setting can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention with shared prefixes. Hydragen computes attention over the shared prefix and unique suffixes separately. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Our method can improve end-to-end CodeLlama-13b…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jordan-benjamin/hydragen
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing