FlashAttention-2: Faster Attention with Better Parallelism and Work   Partitioning

Tri Dao

arXiv:2307.08691·cs.LG·July 18, 2023·141 cites

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao

PDF

Open Access 5 Repos 10 Models

TL;DR

FlashAttention-2 significantly improves GPU utilization and speed for long-sequence transformer attention by optimizing work partitioning, achieving near-GEMM efficiency and higher training throughput.

Contribution

It introduces novel work partitioning strategies in FlashAttention-2 that enhance GPU efficiency and speed, surpassing previous implementations without approximation.

Findings

01

Achieves 2x speedup over FlashAttention

02

Reaches 50-73% of theoretical FLOPs/s on A100 GPUs

03

Enables training GPT-style models at 225 TFLOPs/s

Abstract

Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4 $\times$ compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings