CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression
Wenyuan Liu, Xindian Ma, Peng Zhang, Yan Wang

TL;DR
CrossQuant is a novel post-training quantization method that significantly reduces the quantization kernel size, leading to minimal accuracy loss in large language models during compression.
Contribution
It introduces the concept of the quantization kernel and proposes CrossQuant, a method that achieves smaller kernels and better accuracy preservation in LLM quantization.
Findings
Quantization kernel size correlates with accuracy loss.
CrossQuant reduces kernel size to ~16% for OPT and <0.1% for LLaMA.
Experimental results show improved or maintained model performance.
Abstract
Post-Training Quantization (PTQ) is an effective technique for compressing Large Language Models (LLMs). While many studies focus on quantizing both weights and activations, it is still a challenge to maintain the accuracy of LLM after activating quantization. To investigate the primary cause, we extend the concept of kernel from linear algebra to quantization functions to define a new term, "quantization kernel", which refers to the set of elements in activations that are quantized to zero. Through quantitative analysis of the quantization kernel, we find that these elements are crucial for maintaining the accuracy of quantized LLMs. With the decrease of quantization kernel, the precision of quantized LLMs increases. If the quantization kernel proportion is kept below 19% for OPT models and below 1% for LLaMA models, the precision loss from quantizing activations to INT8 becomes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsSparse Evolutionary Training · LLaMA · OPT · Focus
