BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Jiayi Yuan; Cameron Shinn; Kai Xu; Jingze Cui; George Klimiashvili; Guangxuan Xiao; Perkz Zheng; Bo Li; Yuxin Zhou; Zhouhai Ye; Weijie You; Tian Zheng; Dominic Brown; Pengbo Wang; Markus Hoehnerbach; Richard Cai; Julien Demouth; John D. Owens; Xia Hu; Song Han; Timmy Liu; Huizi Mao

arXiv:2512.12087·cs.CL·April 29, 2026

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Jiayi Yuan, Cameron Shinn, Kai Xu, Jingze Cui, George Klimiashvili, Guangxuan Xiao, Perkz Zheng, Bo Li, Yuxin Zhou, Zhouhai Ye, Weijie You, Tian Zheng, Dominic Brown, Pengbo Wang, Markus Hoehnerbach, Richard Cai, Julien Demouth, John D. Owens, Xia Hu, Song Han, Timmy Liu

PDF

TL;DR

BLASST introduces a dynamic sparse attention mechanism for LLMs that accelerates inference by skipping negligible attention blocks using a simple threshold, achieving significant speedups without retraining.

Contribution

It provides a practical, hardware-friendly, and easy-to-integrate sparse attention method that improves inference speed in LLMs without additional training or pre-computation.

Findings

01

Achieves up to 1.52x speedup in prefill at 71.9% sparsity.

02

Achieves up to 1.48x speedup in decode at 73.2% sparsity.

03

Maintains benchmark accuracy while significantly reducing computation.

Abstract

The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the self-attention mechanism. To address this challenge, we introduce BLASST, a drop-in, dynamic sparse attention mechanism that accelerates inference by using only a fixed scalar threshold to skip attention blocks. Our method targets practical inference deployment by removing the barriers to adoption present in existing works. As such, BLASST eliminates training requirements, avoids expensive pre-computation passes, accelerates both prefill and decode across all major attention variants (MHA, GQA, MQA, and MLA), provides optimized support for modern hardware, and easily integrates into existing frameworks. This is achieved by reusing online softmax statistics to identify negligible attention scores, skipping softmax, value block…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.