PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language   Models

Zhuocheng Gong; Jiahao Liu; Qifan Wang; Yang Yang; Jingang Wang; Wei; Wu; Yunsen Xian; Dongyan Zhao; Rui Yan

arXiv:2306.00014·cs.CL·June 2, 2023·1 cites

PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models

Zhuocheng Gong, Jiahao Liu, Qifan Wang, Yang Yang, Jingang Wang, Wei, Wu, Yunsen Xian, Dongyan Zhao, Rui Yan

PDF

Open Access

TL;DR

PreQuant introduces a task-agnostic quantization framework for pre-trained language models that enables efficient compression without task-specific training, maintaining performance across multiple NLP benchmarks.

Contribution

It proposes a novel 'quantize before fine-tuning' approach that is compatible with various quantization strategies and incorporates parameter-efficient fine-tuning to mitigate quantization errors.

Findings

01

Effective on GLUE benchmark with BERT, RoBERTa, T5

02

Outperforms task-specific quantization methods

03

Reduces model size with minimal performance loss

Abstract

While transformer-based pre-trained language models (PLMs) have dominated a number of NLP applications, these models are heavy to deploy and expensive to use. Therefore, effectively compressing large-scale PLMs becomes an increasingly important problem. Quantization, which represents high-precision tensors with low-bit fix-point format, is a viable solution. However, most existing quantization methods are task-specific, requiring customized training and quantization with a large number of trainable parameters on each individual task. Inspired by the observation that the over-parameterization nature of PLMs makes it possible to freeze most of the parameters during the fine-tuning stage, in this work, we propose a novel ``quantize before fine-tuning'' framework, PreQuant, that differs from both quantization-aware training and post-training quantization. PreQuant is compatible with various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Byte Pair Encoding · Attention Dropout · Linear Warmup With Linear Decay · Residual Connection · SentencePiece · Linear Layer · Adafactor