An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs
Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu, Luo, Hanwen Chang, Qun Gao, Ziheng Wang, Guy Boudoukh, and Moshe Wasserblat

TL;DR
This paper introduces a highly efficient sparse inference software accelerator for Transformer-based language models on CPUs, leveraging structured pruning and Intel hardware features to significantly improve inference speed and support industrial latency requirements.
Contribution
The paper presents a novel sparse deep learning inference stack optimized for CPUs, with a new SpMM kernel that outperforms existing libraries and accelerates Transformer models under real-world constraints.
Findings
SpMM kernel outperforms existing sparse libraries by an order of magnitude.
Achieves up to 5x speedup over dense GEMM in industry-standard libraries.
Demonstrates up to 1.5x speedup over neural network inference on real models.
Abstract
In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Natural Language Processing Techniques
MethodsLib · Multi-Head Attention · Attention Is All You Need · Pruning · Adam · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Weight Decay · Softmax
