Hardware-Efficient Attention for Fast Decoding

Ted Zadouri; Hubert Strauss; Tri Dao

arXiv:2505.21487·cs.LG·May 28, 2025

Hardware-Efficient Attention for Fast Decoding

Ted Zadouri, Hubert Strauss, Tri Dao

PDF

Open Access 2 Repos

TL;DR

This paper introduces hardware-efficient attention mechanisms, GTA and GLA, that improve decoding speed and reduce memory usage in large language models without sacrificing quality.

Contribution

It proposes two novel attention variants, GTA and GLA, optimized for hardware efficiency and parallelism, enabling faster decoding with less memory.

Findings

01

GTA matches GQA quality with half the KV cache.

02

GLA matches MLA quality and is easier to shard.

03

GLA kernel is up to 2× faster than FlashMLA.

Abstract

LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the interplay among arithmetic intensity, parallelization, and model quality and question whether current architectures fully exploit modern hardware. This work redesigns attention to perform more computation per byte loaded from memory to maximize hardware efficiency without trading off parallel scalability. We first propose Grouped-Tied Attention (GTA), a simple variant that combines and reuses key and value states, reducing memory transfers without compromising model quality. We then introduce Grouped Latent Attention (GLA), a parallel-friendly latent attention paired with low-level optimizations for fast decoding while maintaining high model quality.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques · Digital Filter Design and Implementation · Neural Networks and Applications

MethodsDense Connections · Softmax · Feedforward Network · Attention Is All You Need · Grouped-query attention