Zero-Shot Dynamic Quantization for Transformer Inference

Yousef El-Kurdi; Jerry Quinn; Avirup Sil

arXiv:2211.09744·cs.CL·November 18, 2022

Zero-Shot Dynamic Quantization for Transformer Inference

Yousef El-Kurdi, Jerry Quinn, Avirup Sil

PDF

Open Access 4 Repos

TL;DR

This paper presents a run-time zero-shot quantization method for BERT-like models that reduces accuracy loss during 8-bit integer quantization without additional training or calibration, enabling efficient NLP inference.

Contribution

The proposed method allows quantization of transformer models at run-time without training modifications or calibration, simplifying deployment.

Findings

01

Effective on multiple NLP tasks

02

Reduces accuracy loss compared to existing methods

03

No additional calibration needed

Abstract

We introduce a novel run-time method for significantly reducing the accuracy loss associated with quantizing BERT-like models to 8-bit integers. Existing methods for quantizing models either modify the training procedure,or they require an additional calibration step to adjust parameters that also requires a selected held-out dataset. Our method permits taking advantage of quantization without the need for these adjustments. We present results on several NLP tasks demonstrating the usefulness of this technique.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Anomaly Detection Techniques and Applications · Model Reduction and Neural Networks