Flex Attention: A Programming Model for Generating Optimized Attention   Kernels

Juechu Dong; Boyuan Feng; Driss Guessous; Yanbo Liang; Horace He

arXiv:2412.05496·cs.LG·December 10, 2024·2 cites

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, Horace He

PDF

Open Access 10 Models

TL;DR

FlexAttention is a new compiler-driven programming model that simplifies implementing and experimenting with various attention mechanisms in deep learning, achieving competitive performance and enabling easy composition of attention variants.

Contribution

It introduces FlexAttention, a flexible, compiler-based framework that allows rapid implementation and combination of diverse attention variants in PyTorch.

Findings

01

FlexAttention can implement many attention variants with minimal code.

02

It achieves performance comparable to specialized handwritten kernels.

03

FlexAttention facilitates easy composition of different attention mechanisms.

Abstract

Over the past 7 years, attention has become one of the most important primitives in deep learning. The primary approach to optimize attention is FlashAttention, which fuses the operation together, drastically improving both the runtime and the memory consumption. However, the importance of FlashAttention combined with its monolithic nature poses a problem for researchers aiming to try new attention variants -- a "software lottery". This problem is exacerbated by the difficulty of writing efficient fused attention kernels, resisting traditional compiler-based approaches. We introduce FlexAttention, a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. We demonstrate that many existing attention variants (e.g. Alibi, Document Masking, PagedAttention, etc.) can be implemented via FlexAttention, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbedded Systems Design Techniques

MethodsSoftmax · Attention Is All You Need