Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model
Aishwarya Bhandare, Vamsi Sripathi, Deepthi Karkada, Vivek Menon, Sun, Choi, Kushal Datta, Vikram Saletore

TL;DR
This paper demonstrates how to quantize a Transformer language translation model to INT8 on Intel CPUs, achieving significant speedup with minimal accuracy loss, and introduces novel TensorFlow techniques and batching methods.
Contribution
It is the first industry attempt to quantize Transformer models, presenting new techniques for INT8 conversion and CPU utilization optimization.
Findings
1. Achieved 1.5x speedup over FP32 performance.
2. Maintained less than 0.5% accuracy drop.
3. Demonstrated effective INT8 quantization for Transformer models.
Abstract
In this work, we quantize a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel Xeon Cascade Lake processors to improve inference performance while maintaining less than 0.5 drop in accuracy. To the best of our knowledge, this is the first attempt in the industry to quantize the Transformer model. This has high impact as it clearly demonstrates the various complexities of quantizing the language translation model. We present novel quantization techniques directly in TensorFlow to opportunistically replace 32-bit floating point (FP32) computations with 8-bit integers (INT8) and transform the FP32 computational graph. We also present a bin-packing parallel batching technique to maximize CPU utilization. Overall, our optimizations with INT8/VNNI deliver 1.5X improvement over the best FP32 performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
