Empirical Evaluation of Post-Training Quantization Methods for Language   Tasks

Ting Hu; Christoph Meinel; Haojin Yang

arXiv:2210.16621·cs.CL·November 1, 2022

Empirical Evaluation of Post-Training Quantization Methods for Language Tasks

Ting Hu, Christoph Meinel, Haojin Yang

PDF

Open Access

TL;DR

This paper empirically evaluates post-training quantization methods for BERT models, demonstrating that low-bit quantization can maintain high performance and facilitate deployment in resource-limited settings.

Contribution

The study provides a comprehensive comparison of three PTQ methods on BERT models, highlighting OCS's superior performance and exploring the limits of quantization bits for effective model compression.

Findings

01

OCS outperforms other PTQ methods in minimizing quantization error.

02

Low-bit quantized BERT models can outperform 32-bit baselines on some tasks.

03

BERT models can be quantized to 3 bits with minimal performance loss.

Abstract

Transformer-based architectures like BERT have achieved great success in a wide range of Natural Language tasks. Despite their decent performance, the models still have numerous parameters and high computational complexity, impeding their deployment in resource-constrained environments. Post-Training Quantization (PTQ), which enables low-bit computations without extra training, could be a promising tool. In this work, we conduct an empirical evaluation of three PTQ methods on BERT-Base and BERT-Large: Linear Quantization (LQ), Analytical Clipping for Integer Quantization (ACIQ), and Outlier Channel Splitting (OCS). OCS theoretically surpasses the others in minimizing the Mean Square quantization Error and avoiding distorting the weights' outliers. That is consistent with the evaluation results of most language tasks of GLUE benchmark and a reading comprehension task, SQuAD. Moreover,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Dense Connections · Linear Layer · Layer Normalization · Residual Connection · Dropout