TurboAttention: Efficient Attention Approximation For High Throughputs   LLMs

Hao Kang; Srikant Bharadwaj; James Hensman; Tushar Krishna; Victor; Ruhle; Saravan Rajmohan

arXiv:2412.08585·cs.LG·December 18, 2024

TurboAttention: Efficient Attention Approximation For High Throughputs LLMs

Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor, Ruhle, Saravan Rajmohan

PDF

Open Access

TL;DR

TurboAttention introduces a novel quantization and sparsity-based approach to significantly improve the efficiency of attention mechanisms in large language models, reducing memory and computation requirements while increasing throughput.

Contribution

It proposes TurboAttention, combining FlashQ and SAS, to enable efficient quantized attention execution and cache compression, outperforming existing methods.

Findings

01

Achieves 1.2-1.8x speedup in attention computation.

02

Reduces KV cache size by over 4.4x.

03

Enables up to 2.37x throughput increase over FP16 baseline.

Abstract

Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanism. While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation. We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvancements in Photolithography Techniques · Iterative Learning Control Systems · Advanced Surface Polishing Techniques

MethodsAttention Is All You Need · Softmax