ResQ: Mixed-Precision Quantization of Large Language Models with   Low-Rank Residuals

Utkarsh Saxena; Sayeh Sharify; Kaushik Roy; Xin Wang

arXiv:2412.14363·cs.LG·February 5, 2025

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang

PDF

Open Access 1 Repo

TL;DR

ResQ introduces a novel mixed-precision post-training quantization method for large language models, leveraging low-rank subspace analysis and invariant random rotation to significantly reduce quantization error and improve inference efficiency.

Contribution

It proposes a provably optimal mixed-precision quantization scheme using PCA and invariant random rotation, outperforming existing methods on large language models.

Findings

01

Achieves up to 33% lower perplexity on Wikitext.

02

Provides up to 3x speedup over 16-bit baseline.

03

Outperforms recent uniform and mixed precision PTQ methods.

Abstract

Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

utkarsh-dmx/project-resq
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLLaMA