Towards Efficient Post-training Quantization of Pre-trained Language   Models

Haoli Bai; Lu Hou; Lifeng Shang; Xin Jiang; Irwin King; Michael R. Lyu

arXiv:2109.15082·cs.CL·October 1, 2021·21 cites

Towards Efficient Post-training Quantization of Pre-trained Language Models

Haoli Bai, Lu Hou, Lifeng Shang, Xin Jiang, Irwin King, Michael R. Lyu

PDF

Open Access 1 Video

TL;DR

This paper introduces an efficient post-training quantization method for large pre-trained language models that minimizes quantization errors module-wise, enabling faster training with less memory and data requirements while maintaining high performance.

Contribution

The paper proposes a novel module-wise quantization error minimization approach and a parallel training strategy for PLMs, reducing training time and resource use compared to existing methods.

Findings

01

Achieves near-QAT performance with less training overhead.

02

Enables parallel training of model modules across multiple devices.

03

Significantly reduces training time, memory, and data consumption.

Abstract

Network quantization has gained increasing attention with the rapid growth of large pre-trained language models~(PLMs). However, most existing quantization methods for PLMs follow quantization-aware training~(QAT) that requires end-to-end training with full access to the entire dataset. Therefore, they suffer from slow training, large memory overhead, and data security issues. In this paper, we study post-training quantization~(PTQ) of PLMs, and propose module-wise quantization error minimization~(MREM), an efficient solution to mitigate these issues. By partitioning the PLM into multiple modules, we minimize the reconstruction error incurred by quantization for each module. In addition, we design a new model parallel training strategy such that each module can be trained locally on separate computing devices without waiting for preceding modules, which brings nearly the theoretical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Towards Efficient Post-training Quantization of Pre-trained Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications