Block Sparse Flash Attention
Daniel Ohayon, Itay Lamprecht, Itay Hubara, Israel Cohen, Daniel Soudry, Noam Elata

TL;DR
Block-Sparse FlashAttention (BSFA) is a novel method that accelerates long-context inference in large language models by selectively computing the most important query-key interactions, reducing computation by about 50% without sacrificing accuracy.
Contribution
BSFA introduces an exact, training-free block-sparse attention mechanism that significantly speeds up inference while maintaining or improving model accuracy, outperforming existing sparse attention methods.
Findings
Up to 1.10x speedup on Llama-3.1-8B benchmarks.
Maintains above 99% accuracy compared to baseline.
Substantially outperforms existing sparse attention methods.
Abstract
Modern large language models increasingly require long contexts for reasoning and multi-document tasks, but attention's quadratic complexity creates a severe computational bottleneck. We present Block-Sparse FlashAttention (BSFA), a drop-in replacement that accelerates long-context inference while preserving model quality. Unlike methods that predict importance before computing scores, BSFA computes exact query-key similarities to select the top-k most important value blocks for each query. By comparing per-block maximum scores against calibrated thresholds, we skip approximately 50% of the computation and memory transfers for pruned blocks. Our training-free approach requires only a one-time threshold calibration on a small dataset to learn the per-layer and per-head attention score distributions. We provide a CUDA kernel implementation that can be used as a drop-in replacement for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
