GPTQT: Quantize Large Language Models Twice to Push the Efficiency

Yipin Guo; Yilin Lang; Qinyuan Ren

arXiv:2407.02891·cs.LG·July 4, 2024

GPTQT: Quantize Large Language Models Twice to Push the Efficiency

Yipin Guo, Yilin Lang, Qinyuan Ren

PDF

Open Access

TL;DR

GPTQT is a novel two-step quantization method that significantly reduces memory and increases speed of large language models by converting weights into 3-bit and 2-bit binary codes, outperforming existing methods.

Contribution

The paper introduces GPTQT, a new progressive quantization approach that effectively minimizes quantization error and enhances efficiency for large language models.

Findings

01

Reduces perplexity by 4.01 on opt-66B

02

Increases inference speed by 1.24 times on opt-30b

03

Outperforms existing binary coding quantization methods on Llama2

Abstract

Due to their large size, generative Large Language Models (LLMs) require significant computing and storage resources. This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed by expressing the weight of LLM in 3bit/2bit. Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting. Therefore, GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding. A re-explore strategy is proposed to optimize initial scaling factor. During inference, these steps are merged into pure binary coding, enabling efficient computation. Testing across various models and datasets confirms GPTQT's effectiveness. Compared to the strong 3-bit quantization baseline,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings