Fast DistilBERT on CPUs
Haihao Shen, Ofir Zafrir, Bo Dong, Hengyu Meng, Xinyu Ye, Zhe Wang, Yi, Ding, Hanwen Chang, Guy Boudoukh, and Moshe Wasserblat

TL;DR
This paper introduces a pipeline for creating fast, efficient Transformer models like DistilBERT optimized for CPU deployment, achieving significant speedups with minimal accuracy loss for NLP tasks.
Contribution
The authors present a novel CPU-optimized pipeline combining hardware-aware pruning, knowledge distillation, quantization, and a custom runtime engine for Transformer inference.
Findings
Up to 4.1x speedup over ONNX Runtime.
Performance exceeds Neural Magic's DeepSparse by 50%.
Minimal accuracy loss on SQuADv1.1 benchmark.
Abstract
Transformer-based language models have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer models from being used in production. To address this gap, model compression techniques such as quantization and pruning may be used to improve inference efficiency. However, these compression techniques require specialized software to apply and deploy at scale. In this work, we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators. We demonstrate the efficiency of our pipeline by creating a Fast DistilBERT model showing minimal accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Weight Decay · WordPiece · Attention Dropout · Dense Connections · Position-Wise Feed-Forward Layer · Label Smoothing
