LCQ: Low-Rank Codebook based Quantization for Large Language Models
Wen-Pu Cai, Ming-Yang Li, Wu-Jun Li

TL;DR
This paper introduces LCQ, a low-rank codebook quantization method for large language models that improves accuracy over traditional rank-one codebook methods while maintaining low storage costs.
Contribution
The paper proposes a novel low-rank codebook quantization technique that enhances model accuracy without significantly increasing storage requirements.
Findings
LCQ outperforms existing quantization methods in accuracy.
LCQ achieves this with negligible additional storage.
Experiments demonstrate improved performance on large language models.
Abstract
Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for model compression, which can reduce both storage and computational cost. Most existing weight quantization methods for LLMs use a rank-one codebook for quantization, which results in substantial accuracy loss when the compression ratio is high. In this paper, we propose a novel weight quantization method, called low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a low-rank codebook, the rank of which can be larger than one, for quantization. Experiments show that LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
