Norm Tweaking: High-performance Low-bit Quantization of Large Language   Models

Liang Li; Qingyuan Li; Bo Zhang; Xiangxiang Chu

arXiv:2309.02784·cs.LG·December 14, 2023·2 cites

Norm Tweaking: High-performance Low-bit Quantization of Large Language Models

Liang Li, Qingyuan Li, Bo Zhang, Xiangxiang Chu

PDF

Open Access 1 Video

TL;DR

This paper introduces 'norm tweaking', a plugin technique for post-training quantization that significantly improves low-bit quantization accuracy of large language models, enabling 2-bit quantization without accuracy loss.

Contribution

The paper proposes a novel norm tweaking method that enhances existing PTQ techniques, allowing high-precision low-bit quantization of large language models with minimal performance degradation.

Findings

01

Achieves 2-bit quantization accuracy comparable to float models on GLM-130B and OPT-66B.

02

Significantly outperforms existing PTQ methods in weight-only and joint quantization.

03

Demonstrates practical applicability for real-world deployment of large language models.

Abstract

As the size of large language models (LLMs) continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving acceptable 4-bit weight-only quantization, attempts at lower-bit quantization often result in severe performance degradation. In this paper, we introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision while being cost-efficient. Our approach is inspired by the observation that rectifying the quantized activation distribution to match its float counterpart can readily restore accuracy for LLMs. To achieve this, we carefully design a tweaking strategy that includes calibration data generation and channel-wise distance constraint to update the weights of normalization layers for better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Norm Tweaking: High-Performance Low-Bit Quantization of Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis