POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM   Inference

Aditya K Kamath; Ramya Prabhu; Jayashree Mohan; Simon Peter,; Ramachandran Ramjee; Ashish Panwar

arXiv:2410.18038·cs.LG·February 18, 2025

POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter,, Ramachandran Ramjee, Ashish Panwar

PDF

Open Access 1 Repo

TL;DR

POD-Attention introduces a novel GPU kernel that efficiently overlaps prefill and decode phases in LLM inference, significantly boosting GPU utilization and reducing latency for faster, more efficient large language model deployment.

Contribution

This paper presents the first GPU kernel capable of concurrently computing attention for hybrid batches, optimizing resource utilization during LLM inference.

Findings

01

Speeds up attention computation by up to 59%.

02

Achieves mean speedup of 28% over traditional kernels.

03

Enables higher throughput and lower latency in LLM inference.

Abstract

Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid batching that combines the prefill and decode phases of different requests into the same batch. This approach optimizes linear operations but remains inefficient for attention computation because existing attention kernels specialize execution independently for the prefill and decode phases. In this paper, we present POD-Attention - the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources such that prefill and decode operations happen concurrently on the same multiprocessor. POD-Attention speeds up attention computation by up to $59%$ (mean $28%$ ), enabling higher throughput…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/vattention/tree/main/pod_attn
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSoftmax · Attention Is All You Need