Block Sparse Flash Attention

Daniel Ohayon; Itay Lamprecht; Itay Hubara; Israel Cohen; Daniel Soudry; Noam Elata

arXiv:2512.07011·cs.LG·December 9, 2025

Block Sparse Flash Attention

Daniel Ohayon, Itay Lamprecht, Itay Hubara, Israel Cohen, Daniel Soudry, Noam Elata

PDF

Open Access

TL;DR

Block-Sparse FlashAttention (BSFA) is a novel method that accelerates long-context inference in large language models by selectively computing the most important query-key interactions, reducing computation by about 50% without sacrificing accuracy.

Contribution

BSFA introduces an exact, training-free block-sparse attention mechanism that significantly speeds up inference while maintaining or improving model accuracy, outperforming existing sparse attention methods.

Findings

01

Up to 1.10x speedup on Llama-3.1-8B benchmarks.

02

Maintains above 99% accuracy compared to baseline.

03

Substantially outperforms existing sparse attention methods.

Abstract

Modern large language models increasingly require long contexts for reasoning and multi-document tasks, but attention's quadratic complexity creates a severe computational bottleneck. We present Block-Sparse FlashAttention (BSFA), a drop-in replacement that accelerates long-context inference while preserving model quality. Unlike methods that predict importance before computing scores, BSFA computes exact query-key similarities to select the top-k most important value blocks for each query. By comparing per-block maximum scores against calibrated thresholds, we skip approximately 50% of the computation and memory transfers for pruned blocks. Our training-free approach requires only a one-time threshold calibration on a small dataset to learn the per-layer and per-head attention score distributions. We provide a CUDA kernel implementation that can be used as a drop-in replacement for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques