Q8BERT: Quantized 8Bit BERT

Ofir Zafrir; Guy Boudoukh; Peter Izsak; Moshe Wasserblat

arXiv:1910.06188·cs.CL·December 20, 2021

Q8BERT: Quantized 8Bit BERT

Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat

PDF

5 Repos

TL;DR

This paper introduces Q8BERT, a method for quantization-aware training of BERT during fine-tuning, reducing model size by 4x with minimal accuracy loss and enabling faster inference on 8-bit hardware.

Contribution

It presents a novel quantization-aware training approach for BERT, achieving significant compression and speedup without substantial accuracy degradation.

Findings

01

BERT can be compressed 4x with minimal accuracy loss.

02

Quantized BERT accelerates inference on 8-bit hardware.

03

The method maintains high performance after quantization.

Abstract

Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by $4 \times$ with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · GPT · Residual Connection · Attention Dropout