LRQ: Optimizing Post-Training Quantization for Large Language Models by   Learning Low-Rank Weight-Scaling Matrices

Jung Hyun Lee; Jeonghoon Kim; June Yong Yang; Se Jung Kwon; Eunho; Yang; Kang Min Yoo; Dongsoo Lee

arXiv:2407.11534·cs.LG·February 11, 2025

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

Jung Hyun Lee, Jeonghoon Kim, June Yong Yang, Se Jung Kwon, Eunho, Yang, Kang Min Yoo, Dongsoo Lee

PDF

Open Access 1 Video

TL;DR

This paper introduces LRQ, a low-rank weight quantization method for large language models that improves post-training quantization accuracy by reducing parameter count and enhancing generalization, especially on large multitask tasks.

Contribution

LRQ proposes a low-rank weight-scaling matrix approach for post-training quantization, significantly reducing parameters and improving accuracy over prior methods.

Findings

01

LRQ outperforms previous PTQ methods on large language models.

02

LRQ maintains high accuracy with 4-bit weight and 8-bit activation quantization.

03

LRQ demonstrates robustness across various quantization schemes.

Abstract

With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language understanding. To address this issue, we propose Low-Rank Quantization (LRQ) - a simple yet effective post-training weight quantization method for LLMs that reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices, replacing the conventional full weight-scaling matrices that entail as many learnable scales as their associated weights. Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling

MethodsAttention Is All You Need · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout · Multi-Head Attention · Dense Connections