PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models
Zhuocheng Gong, Jiahao Liu, Qifan Wang, Yang Yang, Jingang Wang, Wei, Wu, Yunsen Xian, Dongyan Zhao, Rui Yan

TL;DR
PreQuant introduces a task-agnostic quantization framework for pre-trained language models that enables efficient compression without task-specific training, maintaining performance across multiple NLP benchmarks.
Contribution
It proposes a novel 'quantize before fine-tuning' approach that is compatible with various quantization strategies and incorporates parameter-efficient fine-tuning to mitigate quantization errors.
Findings
Effective on GLUE benchmark with BERT, RoBERTa, T5
Outperforms task-specific quantization methods
Reduces model size with minimal performance loss
Abstract
While transformer-based pre-trained language models (PLMs) have dominated a number of NLP applications, these models are heavy to deploy and expensive to use. Therefore, effectively compressing large-scale PLMs becomes an increasingly important problem. Quantization, which represents high-precision tensors with low-bit fix-point format, is a viable solution. However, most existing quantization methods are task-specific, requiring customized training and quantization with a large number of trainable parameters on each individual task. Inspired by the observation that the over-parameterization nature of PLMs makes it possible to freeze most of the parameters during the fine-tuning stage, in this work, we propose a novel ``quantize before fine-tuning'' framework, PreQuant, that differs from both quantization-aware training and post-training quantization. PreQuant is compatible with various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Byte Pair Encoding · Attention Dropout · Linear Warmup With Linear Decay · Residual Connection · SentencePiece · Linear Layer · Adafactor
