TL;DR
This paper introduces Q8BERT, a method for quantization-aware training of BERT during fine-tuning, reducing model size by 4x with minimal accuracy loss and enabling faster inference on 8-bit hardware.
Contribution
It presents a novel quantization-aware training approach for BERT, achieving significant compression and speedup without substantial accuracy degradation.
Findings
BERT can be compressed 4x with minimal accuracy loss.
Quantized BERT accelerates inference on 8-bit hardware.
The method maintains high performance after quantization.
Abstract
Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · GPT · Residual Connection · Attention Dropout
