Quadapter: Adapter for GPT-2 Quantization
Minseop Park, Jaeseong You, Markus Nagel, Simyung Chang

TL;DR
This paper introduces Quadapter, a small learnable module that improves GPT-2 quantization by preventing overfitting during quantization-aware training, leading to better performance without altering the original model parameters.
Contribution
We propose Quadapter, a novel quantization adapter that scales activations channel-wise to enhance quantization without overfitting, applicable to pretrained models without access to training data.
Findings
Quadapter effectively prevents overfitting during quantization-aware training.
It improves quantization performance on GPT-2.
The method maintains the original model parameters.
Abstract
Transformer language models such as GPT-2 are difficult to quantize because of outliers in activations leading to a large quantization error. To adapt to the error, one must use quantization-aware training, which entails a fine-tuning process based on the dataset and the training pipeline identical to those for the original model. Pretrained language models, however, often do not grant access to their datasets and training pipelines, forcing us to rely on arbitrary ones for fine-tuning. In that case, it is observed that quantization-aware training overfits the model to the fine-tuning data. For quantization without overfitting, we introduce a quantization adapter (Quadapter), a small set of parameters that are learned to make activations quantization-friendly by scaling them channel-wise. It keeps the model parameters unchanged. By applying our method to the challenging task of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Machine Learning and Data Classification
MethodsAttention Is All You Need · Byte Pair Encoding · Linear Layer · Weight Decay · Multi-Head Attention · Discriminative Fine-Tuning · Dense Connections · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection
