BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers

Patrik Okanovic; Sameer Deshmukh; Grzegorz Kwasniewski; Yi Zhu; Haruto Fujii; Sakina Fatima; Maciej Besta; Kentaro Katayama; Takumi Honda; Yusuke Nagasaka; Torsten Hoefler

arXiv:2507.03117·cs.LG·October 28, 2025

BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers

Patrik Okanovic, Sameer Deshmukh, Grzegorz Kwasniewski, Yi Zhu, Haruto Fujii, Sakina Fatima, Maciej Besta, Kentaro Katayama, Takumi Honda, Yusuke Nagasaka, Torsten Hoefler

PDF

Open Access

TL;DR

BLaST introduces a block sparsity method for large-scale ML models that significantly reduces data movement and inference costs while maintaining accuracy, enabling faster and more cost-effective inference.

Contribution

The paper presents BLaST, a novel sparsification technique that achieves high sparsity with minimal accuracy loss and improves inference speed and memory efficiency.

Findings

01

Up to 95% sparsity with <2.25% accuracy loss

02

2.2x inference speedup on Llama 3.2 with 16 GPUs

03

4.45x reduction in inference memory footprint

Abstract

The energy consumption of large-scale ML models is dominated by data movement, shuffling billions of parameters across memory hierarchies and data centers. Sparsification offers a principled way to mitigate these costs by pruning redundant weights and activations, thereby reducing data movement. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable method for sparsification, applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss (majority <2.25%). We show a 2.2x inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability