LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices
Jung Hyun Lee, Jeonghoon Kim, June Yong Yang, Se Jung Kwon, Eunho, Yang, Kang Min Yoo, Dongsoo Lee

TL;DR
This paper introduces LRQ, a low-rank weight quantization method for large language models that improves post-training quantization accuracy by reducing parameter count and enhancing generalization, especially on large multitask tasks.
Contribution
LRQ proposes a low-rank weight-scaling matrix approach for post-training quantization, significantly reducing parameters and improving accuracy over prior methods.
Findings
LRQ outperforms previous PTQ methods on large language models.
LRQ maintains high accuracy with 4-bit weight and 8-bit activation quantization.
LRQ demonstrates robustness across various quantization schemes.
Abstract
With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language understanding. To address this issue, we propose Low-Rank Quantization (LRQ) - a simple yet effective post-training weight quantization method for LLMs that reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices, replacing the conventional full weight-scaling matrices that entail as many learnable scales as their associated weights. Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling
MethodsAttention Is All You Need · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout · Multi-Head Attention · Dense Connections
