ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Lu Ye, Ze Tao, Yong Huang, Yang Li

TL;DR
ChunkAttention introduces a prefix-aware self-attention mechanism that efficiently shares key/value tensors across requests with shared prompts, significantly reducing inference latency in large language models.
Contribution
It proposes a novel prefix-aware KV cache and a two-phase partition algorithm to enhance memory utilization and speed up self-attention computation for long sequences.
Findings
Achieves 3.2-4.8× speedup over state-of-the-art self-attention implementations.
Effectively shares KV tensors across requests with shared prompts, improving memory efficiency.
Supports long system prompts ranging from 1024 to 4096 tokens.
Abstract
Self-attention is an essential component of large language models (LLM) but a significant source of inference latency for long sequences. In multi-tenant LLM serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · Error Correcting Code Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
