Zero-Shot Dynamic Quantization for Transformer Inference
Yousef El-Kurdi, Jerry Quinn, Avirup Sil

TL;DR
This paper presents a run-time zero-shot quantization method for BERT-like models that reduces accuracy loss during 8-bit integer quantization without additional training or calibration, enabling efficient NLP inference.
Contribution
The proposed method allows quantization of transformer models at run-time without training modifications or calibration, simplifying deployment.
Findings
Effective on multiple NLP tasks
Reduces accuracy loss compared to existing methods
No additional calibration needed
Abstract
We introduce a novel run-time method for significantly reducing the accuracy loss associated with quantizing BERT-like models to 8-bit integers. Existing methods for quantizing models either modify the training procedure,or they require an additional calibration step to adjust parameters that also requires a selected held-out dataset. Our method permits taking advantage of quantization without the need for these adjustments. We present results on several NLP tasks demonstrating the usefulness of this technique.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Anomaly Detection Techniques and Applications · Model Reduction and Neural Networks
