Hardware-Efficient Attention for Fast Decoding
Ted Zadouri, Hubert Strauss, Tri Dao

TL;DR
This paper introduces hardware-efficient attention mechanisms, GTA and GLA, that improve decoding speed and reduce memory usage in large language models without sacrificing quality.
Contribution
It proposes two novel attention variants, GTA and GLA, optimized for hardware efficiency and parallelism, enabling faster decoding with less memory.
Findings
GTA matches GQA quality with half the KV cache.
GLA matches MLA quality and is easier to shard.
GLA kernel is up to 2× faster than FlashMLA.
Abstract
LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the interplay among arithmetic intensity, parallelization, and model quality and question whether current architectures fully exploit modern hardware. This work redesigns attention to perform more computation per byte loaded from memory to maximize hardware efficiency without trading off parallel scalability. We first propose Grouped-Tied Attention (GTA), a simple variant that combines and reuses key and value states, reducing memory transfers without compromising model quality. We then introduce Grouped Latent Attention (GLA), a parallel-friendly latent attention paired with low-level optimizations for fast decoding while maintaining high model quality.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlind Source Separation Techniques · Digital Filter Design and Implementation · Neural Networks and Applications
MethodsDense Connections · Softmax · Feedforward Network · Attention Is All You Need · Grouped-query attention
