TurboAttention: Efficient Attention Approximation For High Throughputs LLMs
Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor, Ruhle, Saravan Rajmohan

TL;DR
TurboAttention introduces a novel quantization and sparsity-based approach to significantly improve the efficiency of attention mechanisms in large language models, reducing memory and computation requirements while increasing throughput.
Contribution
It proposes TurboAttention, combining FlashQ and SAS, to enable efficient quantized attention execution and cache compression, outperforming existing methods.
Findings
Achieves 1.2-1.8x speedup in attention computation.
Reduces KV cache size by over 4.4x.
Enables up to 2.37x throughput increase over FP16 baseline.
Abstract
Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanism. While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation. We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Photolithography Techniques · Iterative Learning Control Systems · Advanced Surface Polishing Techniques
MethodsAttention Is All You Need · Softmax
