Fast DistilBERT on CPUs

Haihao Shen; Ofir Zafrir; Bo Dong; Hengyu Meng; Xinyu Ye; Zhe Wang; Yi; Ding; Hanwen Chang; Guy Boudoukh; and Moshe Wasserblat

arXiv:2211.07715·cs.CL·December 8, 2022

Fast DistilBERT on CPUs

Haihao Shen, Ofir Zafrir, Bo Dong, Hengyu Meng, Xinyu Ye, Zhe Wang, Yi, Ding, Hanwen Chang, Guy Boudoukh, and Moshe Wasserblat

PDF

Open Access 1 Repo

TL;DR

This paper introduces a pipeline for creating fast, efficient Transformer models like DistilBERT optimized for CPU deployment, achieving significant speedups with minimal accuracy loss for NLP tasks.

Contribution

The authors present a novel CPU-optimized pipeline combining hardware-aware pruning, knowledge distillation, quantization, and a custom runtime engine for Transformer inference.

Findings

01

Up to 4.1x speedup over ONNX Runtime.

02

Performance exceeds Neural Magic's DeepSparse by 50%.

03

Minimal accuracy loss on SQuADv1.1 benchmark.

Abstract

Transformer-based language models have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer models from being used in production. To address this gap, model compression techniques such as quantization and pruning may be used to improve inference efficiency. However, these compression techniques require specialized software to apply and deploy at scale. In this work, we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators. We demonstrate the efficiency of our pipeline by creating a Fast DistilBERT model showing minimal accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

intel/intel-extension-for-transformers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Weight Decay · WordPiece · Attention Dropout · Dense Connections · Position-Wise Feed-Forward Layer · Label Smoothing