An Efficient Sparse Inference Software Accelerator for Transformer-based   Language Models on CPUs

Haihao Shen; Hengyu Meng; Bo Dong; Zhe Wang; Ofir Zafrir; Yi Ding; Yu; Luo; Hanwen Chang; Qun Gao; Ziheng Wang; Guy Boudoukh; and Moshe Wasserblat

arXiv:2306.16601·cs.LG·June 30, 2023

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu, Luo, Hanwen Chang, Qun Gao, Ziheng Wang, Guy Boudoukh, and Moshe Wasserblat

PDF

Open Access 1 Repo

TL;DR

This paper introduces a highly efficient sparse inference software accelerator for Transformer-based language models on CPUs, leveraging structured pruning and Intel hardware features to significantly improve inference speed and support industrial latency requirements.

Contribution

The paper presents a novel sparse deep learning inference stack optimized for CPUs, with a new SpMM kernel that outperforms existing libraries and accelerates Transformer models under real-world constraints.

Findings

01

SpMM kernel outperforms existing sparse libraries by an order of magnitude.

02

Achieves up to 5x speedup over dense GEMM in industry-standard libraries.

03

Demonstrates up to 1.5x speedup over neural network inference on real models.

Abstract

In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

intel/intel-extension-for-transformers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Natural Language Processing Techniques

MethodsLib · Multi-Head Attention · Attention Is All You Need · Pruning · Adam · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Weight Decay · Softmax