Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor R\"uhle,, Saravan Rajmohan

TL;DR
This paper introduces LeanAttention, a hardware-aware scalable attention mechanism optimized for the decode-phase of transformers, significantly improving speed for long context lengths by parallelizing attention computation.
Contribution
The paper presents LeanAttention, a novel attention computation method that re-designs execution flow for the decode-phase, enabling efficient parallel processing of long contexts in transformer models.
Findings
Achieves 2.6x average speedup over FlashAttention-2.
Up to 8.33x speedup for 512k context lengths.
Effectively scales attention computation for long sequences.
Abstract
Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching billions of parameters. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attention operation is quadratic in terms of the total context length, i.e., prompt and output tokens. Thus, several optimizations such as key-value tensor caching and FlashAttention computation have been proposed to deliver the low latency demands of applications relying on such large models. However, these techniques do not cater to the computationally distinct nature of different phases during inference. To that end, we propose LeanAttention, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsManufacturing Process and Optimization · Industrial Vision Systems and Defect Detection
MethodsSoftmax
