Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Sheng Shen; Zhen Dong; Jiayu Ye; Linjian Ma; Zhewei Yao; and Amir Gholami; Michael W. Mahoney; Kurt Keutzer

arXiv:1909.05840·cs.CL·April 21, 2021·52 cites

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, and Amir Gholami, Michael W. Mahoney, Kurt Keutzer

PDF

Open Access

TL;DR

This paper introduces a Hessian-based ultra low precision quantization method for BERT, significantly reducing model size with minimal performance loss across multiple NLP tasks.

Contribution

It proposes a novel group-wise quantization scheme combined with Hessian-based mix-precision, enabling efficient compression of BERT models for resource-constrained deployment.

Findings

01

Achieves up to 13x compression with only 2.3% performance degradation.

02

Effective on multiple NLP tasks including SST-2, MNLI, CoNLL-03, and SQuAD.

03

Highest performance loss observed on SQuAD due to training convergence issues.

Abstract

Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning

MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax