FlashAttention: Fast and Memory-Efficient Exact Attention with   IO-Awareness

Tri Dao; Daniel Y. Fu; Stefano Ermon; Atri Rudra; Christopher R\'e

arXiv:2205.14135·cs.LG·June 24, 2022·457 cites

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher R\'e

PDF

Open Access 5 Repos 10 Models 1 Datasets 2 Videos

TL;DR

FlashAttention introduces an IO-aware exact attention algorithm that significantly accelerates Transformer training and enables longer context processing by optimizing memory reads/writes, outperforming existing methods in speed and model quality.

Contribution

The paper presents FlashAttention, a novel IO-aware exact attention algorithm that reduces memory access costs and extends to block-sparse attention, achieving faster training and longer context handling.

Findings

01

15% speedup on BERT-large training

02

3x speedup on GPT-2 with 1K sequences

03

Enables Transformers to process 64K sequences with improved accuracy

Abstract

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

WhiteAiZ/sd-webui-forge-classic
dataset· 634 dl
634 dl

Videos

Flash Attention 2.0 with Tri Dao (author)! | Discord server talks· youtube

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning

MethodsWhat is the best way to complain to Expedia?*BestWaysToComplain · Attention Is All You Need · Feedforward Network · Grouped-query attention · Multi-Query Attention · Rotary Position Embedding · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Layer Normalization · Softmax