FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher R\'e

TL;DR
FlashAttention introduces an IO-aware exact attention algorithm that significantly accelerates Transformer training and enables longer context processing by optimizing memory reads/writes, outperforming existing methods in speed and model quality.
Contribution
The paper presents FlashAttention, a novel IO-aware exact attention algorithm that reduces memory access costs and extends to block-sparse attention, achieving faster training and longer context handling.
Findings
15% speedup on BERT-large training
3x speedup on GPT-2 with 1K sequences
Enables Transformers to process 64K sequences with improved accuracy
Abstract
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗bigcode/starcodermodel· 10k dl· ♡ 293210k dl♡ 2932
- 🤗bigcode/starcoder2-15bmodel· 5.2k dl· ♡ 6655.2k dl♡ 665
- 🤗tiiuae/falcon-40bmodel· 22k dl· ♡ 243322k dl♡ 2433
- 🤗mosaicml/mosaic-bert-basemodel· 90 dl· ♡ 4790 dl♡ 47
- 🤗tiiuae/falcon-7bmodel· 153k dl· ♡ 1099153k dl♡ 1099
- 🤗tiiuae/falcon-7b-instructmodel· 58k dl· ♡ 103158k dl♡ 1031
- 🤗bigcode/starcoderbase-megatronmodel· ♡ 2♡ 2
- 🤗mosaicml/mosaic-bert-base-seqlen-512model· 15 dl· ♡ 415 dl♡ 4
- 🤗tiiuae/falcon-rw-1bmodel· 12k dl· ♡ 11812k dl♡ 118
- 🤗tiiuae/falcon-rw-7bmodel· 335 dl· ♡ 17335 dl♡ 17
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning
MethodsWhat is the best way to complain to Expedia?*BestWaysToComplain · Attention Is All You Need · Feedforward Network · Grouped-query attention · Multi-Query Attention · Rotary Position Embedding · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Layer Normalization · Softmax
