FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao

TL;DR
FlashAttention-2 significantly improves GPU utilization and speed for long-sequence transformer attention by optimizing work partitioning, achieving near-GEMM efficiency and higher training throughput.
Contribution
It introduces novel work partitioning strategies in FlashAttention-2 that enhance GPU efficiency and speed, surpassing previous implementations without approximation.
Findings
Achieves 2x speedup over FlashAttention
Reaches 50-73% of theoretical FLOPs/s on A100 GPUs
Enables training GPT-style models at 225 TFLOPs/s
Abstract
Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4 compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗PJMixers-Images/Florence-2-base-Castollux-v0.5model· 505 dl· ♡ 5505 dl♡ 5
- 🤗tiiuae/falcon-11Bmodel· 5.0k dl· ♡ 2185.0k dl♡ 218
- 🤗LoneStriker/falcon-11B-GGUFmodel· 59 dl· ♡ 359 dl♡ 3
- 🤗vsevolodl/falcon-11B-GGUFmodel· 27 dl· ♡ 127 dl♡ 1
- 🤗RichardErkhov/tiiuae_-_falcon-11B-ggufmodel· 78 dl78 dl
- 🤗amirMohammadi/Dorna-Llama3-8B-Instruct-Quantized4Bitmodel· 16 dl· ♡ 1116 dl♡ 11
- 🤗frederic-sadrieh/BERTchen-v0.1model· 3 dl· ♡ 13 dl♡ 1
- 🤗frederic-sadrieh/BERTchen-v0.1-C4model· 8 dl8 dl
- 🤗frederic-sadrieh/hybrid-BERTchen-v0.1model· 8 dl8 dl
- 🤗QuantFactory/falcon-11B-GGUFmodel· 261 dl· ♡ 3261 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
