TEQ: Trainable Equivalent Transformation for Quantization of LLMs

Wenhua Cheng; Yiyang Cai; Kaokao Lv; Haihao Shen

arXiv:2310.10944·cs.CL·October 18, 2023·1 cites

TEQ: Trainable Equivalent Transformation for Quantization of LLMs

Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen

PDF

Open Access 1 Repo

TL;DR

TEQ introduces a lightweight, trainable transformation that enables low-precision quantization of large language models without sacrificing accuracy or adding inference overhead, matching state-of-the-art performance.

Contribution

The paper proposes TEQ, a novel trainable equivalent transformation that preserves FP32 output precision during low-bit quantization of LLMs, requiring minimal training and no additional inference cost.

Findings

01

Achieves state-of-the-art quantization performance on LLMs.

02

Requires only 1K training steps and less than 0.1% of model parameters.

03

Compatible with other methods for enhanced performance.

Abstract

As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computationalast layer demands of these modern architectures while maintaining the accuracy. In this paper, we present TEQ, a trainable equivalent transformation that preserves the FP32 precision of the model output while taking advantage of low-precision quantization, especially 3 and 4 bits weight-only quantization. The training process is lightweight, requiring only 1K steps and fewer than 0.1 percent of the original model's trainable parameters. Furthermore, the transformation does not add any computational overhead during inference. Our results are on-par with the state-of-the-art (SOTA) methods on typical LLMs. Our approach can be combined with other methods to achieve even better performance. The code is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

intel/neural-compressor
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis