FP8-BERT: Post-Training Quantization for Transformer
Jianwei Li, Tianchi Zhang, Ian En-Hsu Yen, Dongkuan Xu

TL;DR
This paper demonstrates that FP8-based post-training quantization can significantly reduce model size and inference costs for BERT without accuracy loss, outperforming traditional INT8 quantization methods.
Contribution
The paper empirically validates FP8 as an effective format for post-training quantization of BERT, achieving accuracy comparable to full-precision models.
Findings
FP8 PTQ maintains full-precision accuracy on BERT.
FP8 outperforms INT8 in accuracy retention.
Simple calibration enables effective FP8 quantization.
Abstract
Transformer-based models, such as BERT, have been widely applied in a wide range of natural language processing tasks. However, one inevitable side effect is that they require massive memory storage and inference cost when deployed in production. Quantization is one of the popularized ways to alleviate the cost. However, the previous 8-bit quantization strategy based on INT8 data format either suffers from the degradation of accuracy in a Post-Training Quantization (PTQ) fashion or requires an expensive Quantization-Aware Training (QAT) process. Recently, a new numeric format FP8 (i.e. floating-point of 8-bits) has been proposed and supported in commercial AI computing platforms such as H100. In this paper, we empirically validate the effectiveness of FP8 as a way to do Post-Training Quantization without significant loss of accuracy, with a simple calibration and format conversion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Neural Networks and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Dense Connections · Attention Dropout · Linear Warmup With Linear Decay · WordPiece · Adam · Weight Decay
