Efficient 8-Bit Quantization of Transformer Neural Machine Language   Translation Model

Aishwarya Bhandare; Vamsi Sripathi; Deepthi Karkada; Vivek Menon; Sun; Choi; Kushal Datta; Vikram Saletore

arXiv:1906.00532·cs.LG·June 10, 2019·57 cites

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

Aishwarya Bhandare, Vamsi Sripathi, Deepthi Karkada, Vivek Menon, Sun, Choi, Kushal Datta, Vikram Saletore

PDF

Open Access

TL;DR

This paper demonstrates how to quantize a Transformer language translation model to INT8 on Intel CPUs, achieving significant speedup with minimal accuracy loss, and introduces novel TensorFlow techniques and batching methods.

Contribution

It is the first industry attempt to quantize Transformer models, presenting new techniques for INT8 conversion and CPU utilization optimization.

Findings

01

1. Achieved 1.5x speedup over FP32 performance.

02

2. Maintained less than 0.5% accuracy drop.

03

3. Demonstrated effective INT8 quantization for Transformer models.

Abstract

In this work, we quantize a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel $^{®}$ Xeon $^{®}$ Cascade Lake processors to improve inference performance while maintaining less than 0.5 $%$ drop in accuracy. To the best of our knowledge, this is the first attempt in the industry to quantize the Transformer model. This has high impact as it clearly demonstrates the various complexities of quantizing the language translation model. We present novel quantization techniques directly in TensorFlow to opportunistically replace 32-bit floating point (FP32) computations with 8-bit integers (INT8) and transform the FP32 computational graph. We also present a bin-packing parallel batching technique to maximize CPU utilization. Overall, our optimizations with INT8/VNNI deliver 1.5X improvement over the best FP32 performance.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax