ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and   Two-Phase Partition

Lu Ye; Ze Tao; Yong Huang; Yang Li

arXiv:2402.15220·cs.LG·August 2, 2024·1 cites

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Lu Ye, Ze Tao, Yong Huang, Yang Li

PDF

Open Access 1 Repo 1 Video

TL;DR

ChunkAttention introduces a prefix-aware self-attention mechanism that efficiently shares key/value tensors across requests with shared prompts, significantly reducing inference latency in large language models.

Contribution

It proposes a novel prefix-aware KV cache and a two-phase partition algorithm to enhance memory utilization and speed up self-attention computation for long sequences.

Findings

01

Achieves 3.2-4.8× speedup over state-of-the-art self-attention implementations.

02

Effectively shares KV tensors across requests with shared prompts, improving memory efficiency.

03

Supports long system prompts ranging from 1024 to 4096 tokens.

Abstract

Self-attention is an essential component of large language models (LLM) but a significant source of inference latency for long sequences. In multi-tenant LLM serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/chunk-attention
pytorchOfficial

Videos

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition· underline

Taxonomy

TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · Error Correcting Code Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings