CrossQuant: A Post-Training Quantization Method with Smaller   Quantization Kernel for Precise Large Language Model Compression

Wenyuan Liu; Xindian Ma; Peng Zhang; Yan Wang

arXiv:2410.07505·cs.LG·October 11, 2024

CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression

Wenyuan Liu, Xindian Ma, Peng Zhang, Yan Wang

PDF

Open Access

TL;DR

CrossQuant is a novel post-training quantization method that significantly reduces the quantization kernel size, leading to minimal accuracy loss in large language models during compression.

Contribution

It introduces the concept of the quantization kernel and proposes CrossQuant, a method that achieves smaller kernels and better accuracy preservation in LLM quantization.

Findings

01

Quantization kernel size correlates with accuracy loss.

02

CrossQuant reduces kernel size to ~16% for OPT and <0.1% for LLaMA.

03

Experimental results show improved or maintained model performance.

Abstract

Post-Training Quantization (PTQ) is an effective technique for compressing Large Language Models (LLMs). While many studies focus on quantizing both weights and activations, it is still a challenge to maintain the accuracy of LLM after activating quantization. To investigate the primary cause, we extend the concept of kernel from linear algebra to quantization functions to define a new term, "quantization kernel", which refers to the set of elements in activations that are quantized to zero. Through quantitative analysis of the quantization kernel, we find that these elements are crucial for maintaining the accuracy of quantized LLMs. With the decrease of quantization kernel, the precision of quantized LLMs increases. If the quantization kernel proportion is kept below 19% for OPT models and below 1% for LLaMA models, the precision loss from quantizing activations to INT8 becomes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training · LLaMA · OPT · Focus