Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, and Amir Gholami, Michael W. Mahoney, Kurt Keutzer

TL;DR
This paper introduces a Hessian-based ultra low precision quantization method for BERT, significantly reducing model size with minimal performance loss across multiple NLP tasks.
Contribution
It proposes a novel group-wise quantization scheme combined with Hessian-based mix-precision, enabling efficient compression of BERT models for resource-constrained deployment.
Findings
Achieves up to 13x compression with only 2.3% performance degradation.
Effective on multiple NLP tasks including SST-2, MNLI, CoNLL-03, and SQuAD.
Highest performance loss observed on SQuAD due to training convergence issues.
Abstract
Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning
MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax
