Efficiently Dispatching Flash Attention For Partially Filled Attention Masks
Agniv Sharma, Jonas Geiping

TL;DR
This paper introduces Binary Block Masking, an enhancement to Flash Attention that efficiently handles sparse and partially filled attention masks, significantly reducing runtime in real-world scenarios.
Contribution
The paper presents a mask-aware modification to Flash Attention and two optimizations for different sparsity patterns, improving efficiency for sparse attention matrices.
Findings
Up to 9x runtime improvement on real-world masks
Binary Block Masking effectively handles sparse and partially filled attention matrices
Implementation will be publicly released
Abstract
Transformers are widely used across various applications, many of which yield sparse or partially filled attention matrices. Examples include attention masks designed to reduce the quadratic complexity of attention, sequence packing techniques, and recent innovations like tree masking for fast validation in MEDUSA. Despite the inherent sparsity in these matrices, the state-of-the-art algorithm Flash Attention still processes them with quadratic complexity as though they were dense. In this paper, we introduce Binary Block Masking, a highly efficient modification that enhances Flash Attention by making it mask-aware. We further propose two optimizations: one tailored for masks with contiguous non-zero patterns and another for extremely sparse masks. Our experiments on attention masks derived from real-world scenarios demonstrate up to a 9x runtime improvement. The implementation will be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · CCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing
MethodsSoftmax · Attention Is All You Need
