Flex Attention: A Programming Model for Generating Optimized Attention Kernels
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, Horace He

TL;DR
FlexAttention is a new compiler-driven programming model that simplifies implementing and experimenting with various attention mechanisms in deep learning, achieving competitive performance and enabling easy composition of attention variants.
Contribution
It introduces FlexAttention, a flexible, compiler-based framework that allows rapid implementation and combination of diverse attention variants in PyTorch.
Findings
FlexAttention can implement many attention variants with minimal code.
It achieves performance comparable to specialized handwritten kernels.
FlexAttention facilitates easy composition of different attention mechanisms.
Abstract
Over the past 7 years, attention has become one of the most important primitives in deep learning. The primary approach to optimize attention is FlashAttention, which fuses the operation together, drastically improving both the runtime and the memory consumption. However, the importance of FlashAttention combined with its monolithic nature poses a problem for researchers aiming to try new attention variants -- a "software lottery". This problem is exacerbated by the difficulty of writing efficient fused attention kernels, resisting traditional compiler-based approaches. We introduce FlexAttention, a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. We demonstrate that many existing attention variants (e.g. Alibi, Document Masking, PagedAttention, etc.) can be implemented via FlexAttention, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Synthyra/ANKH_largemodel· 412 dl· ♡ 1412 dl♡ 1
- 🤗Synthyra/ANKH_basemodel· 610 dl610 dl
- 🤗Synthyra/ANKH2_largemodel· 391 dl· ♡ 1391 dl♡ 1
- 🤗Synthyra/FastESM2_650model· 300 dl· ♡ 3300 dl♡ 3
- 🤗Synthyra/ESMplusplus_largemodel· 80k dl· ♡ 1780k dl♡ 17
- 🤗Synthyra/ESMplusplus_smallmodel· 27k dl· ♡ 1827k dl♡ 18
- 🤗Synthyra/ESM2-8Mmodel· 1.3k dl· ♡ 21.3k dl♡ 2
- 🤗Synthyra/ESM2-35Mmodel· 803 dl· ♡ 2803 dl♡ 2
- 🤗Synthyra/ESM2-150Mmodel· 366 dl· ♡ 1366 dl♡ 1
- 🤗Synthyra/ESM2-650Mmodel· 335 dl· ♡ 1335 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques
MethodsSoftmax · Attention Is All You Need
