FP8-BERT: Post-Training Quantization for Transformer

Jianwei Li; Tianchi Zhang; Ian En-Hsu Yen; Dongkuan Xu

arXiv:2312.05725·cs.AI·December 13, 2023·2 cites

FP8-BERT: Post-Training Quantization for Transformer

Jianwei Li, Tianchi Zhang, Ian En-Hsu Yen, Dongkuan Xu

PDF

Open Access

TL;DR

This paper demonstrates that FP8-based post-training quantization can significantly reduce model size and inference costs for BERT without accuracy loss, outperforming traditional INT8 quantization methods.

Contribution

The paper empirically validates FP8 as an effective format for post-training quantization of BERT, achieving accuracy comparable to full-precision models.

Findings

01

FP8 PTQ maintains full-precision accuracy on BERT.

02

FP8 outperforms INT8 in accuracy retention.

03

Simple calibration enables effective FP8 quantization.

Abstract

Transformer-based models, such as BERT, have been widely applied in a wide range of natural language processing tasks. However, one inevitable side effect is that they require massive memory storage and inference cost when deployed in production. Quantization is one of the popularized ways to alleviate the cost. However, the previous 8-bit quantization strategy based on INT8 data format either suffers from the degradation of accuracy in a Post-Training Quantization (PTQ) fashion or requires an expensive Quantization-Aware Training (QAT) process. Recently, a new numeric format FP8 (i.e. floating-point of 8-bits) has been proposed and supported in commercial AI computing platforms such as H100. In this paper, we empirically validate the effectiveness of FP8 as a way to do Post-Training Quantization without significant loss of accuracy, with a simple calibration and format conversion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Neural Networks and Applications

MethodsMulti-Head Attention · Attention Is All You Need · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Dense Connections · Attention Dropout · Linear Warmup With Linear Decay · WordPiece · Adam · Weight Decay