Accelerating Sparse Transformer Inference on GPU

Wenhao Dai; Haodong Deng; Mengfei Rong; Xinyu Yang; Hongyu Liu; Fangxin Liu; Hailong Yang; Qianwen Cao; Qingxiao Sun

arXiv:2506.06095·cs.LG·May 20, 2026

Accelerating Sparse Transformer Inference on GPU

Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun

PDF

TL;DR

STOF is a GPU framework that optimizes sparse Transformer inference by enabling flexible masking and operator fusion, achieving significant speedups over previous methods.

Contribution

The paper introduces STOF, a novel GPU-based framework that enhances sparse Transformer inference through adaptive masking and operator fusion techniques.

Findings

01

Achieves up to 1.6x speedup in MHA computation.

02

Achieves up to 1.4x speedup in end-to-end inference.

03

Effectively adapts to diverse application scenarios.

Abstract

Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. In addition, current static operator fusion schemes fail to adapt to diverse application scenarios. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Parallel Computing and Optimization Techniques