Low-Rank Correction for Quantized LLMs

Meyer Scetbon; James Hensman

arXiv:2412.07902·stat.ML·December 12, 2024

Low-Rank Correction for Quantized LLMs

Meyer Scetbon, James Hensman

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a low-rank correction method for quantized large language models, significantly improving accuracy by adding full-precision low-rank matrices to correct quantization errors in activations.

Contribution

It proposes a novel joint optimization approach for quantizing weights and activations using low-rank matrices, enhancing post-training model compression for LLMs.

Findings

01

Reduces accuracy gap by over 50% with 10% rank matrices.

02

Achieves complete accuracy recovery at 30% rank.

03

Demonstrates effectiveness on Llama-2, Llama-3, Phi-3, and Mixtral models.

Abstract

We consider the problem of model compression for Large Language Models (LLMs) at post-training time, where the task is to compress a well-trained model using only a small set of calibration input data. In this work, we introduce a new low-rank approach to correct for quantization errors of \emph{activations} in LLMs: we propose to add low-rank weight matrices in full precision that act on the \emph{unquantized} activations. We then solve a joint optimization problem over the quantized representation of the weights and additional low-rank weight matrices to quantize both weights and activations. We focus on the case of 4-bit weight-and-activation quantization (W4A4). Using ranks equivalent to 10\% of the original weight matrix size, our approach reduces the accuracy gap with the original model by more than 50\%. Using ranks equivalent to 30\% of the original weight matrix, the accuracy…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

- They paper is written well and easy to understand. The scheme they propose is sound and intuitive in its formulation. - Their method can use any weight quantization technique as a subroutine (they use GPTQ in the paper), which allows other tools/papers to plugin their own method. - They perform sensible ablations in order to clearly identify the impact of the weight only quantization vs activation quantization, and when low rank error correction offers value. Moreover, they also show that the

Weaknesses

# Major - *Limited Contribution*: The paper stitches together many well known building blocks in the PTQ literature to build a sane, effective technique. In my opinion, it is a sound engineering feat, but still has high overlap with the previous work on the topic by Zhang et al (2024) and Ou et al (2024). The authors do differentiate themselves by the fact that they do a joint optimization over the low rank and quantized matrices which is key to the delta over the previous work. However, this is

Reviewer 02Rating 6Confidence 2

Strengths

* The idea of introducing low-rank adaptation to correct the quantization error is good, and trivially effective. In my perspective, these adaptation-based methods are worthy of further and comprehensive study. * The derivation in this paper provides concise intuition, which is easy to follow. * The topic of efficient LLM deployment is becoming vital currently, this method has considerable potential in addressing such PTQ problems on LLMs.

Weaknesses

* It is not clear how the rank of adaptation would influence the efficiency. This would become my main concern for this paper. I strongly recommend the authors to add an experiment to evaluate its enhancement of memory usage and speed. * The presentation of this paper is good, but not excellent enough. The authors should add an introduction of GPTQ method and Cholesky in their appendix (since they are parts of the main algorithm) for presenting to the broader audience. * The choice of dataset in

Reviewer 03Rating 5Confidence 2

Strengths

This paper is its novel approach to quantizing large language models to 4-bit weights and activations while maintaining high accuracy. The LRC method's ability to optimize jointly for a quantized weight matrix and a full-precision low-rank correction matrix, which is connected to the original unquantized activations, effectively reduces quantization error. This innovative technique sets LRC apart from previous approaches and demonstrates its potential for enabling highly compressed models with m

Weaknesses

The paper does not analyze the computational cost associated with the added low-rank correction matrix. While the method effectively reduces quantization error, the impact on inference time and memory usage is not thoroughly explored. This is an important consideration for the practical deployment of the LRC method. The authors leave the ideal implementation of the low-rank computation for future work. Without a concrete implementation strategy, it may be difficult for practitioners to immediat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMagnetic confinement fusion research · Particle accelerators and beam dynamics · Particle Accelerators and Free-Electron Lasers

MethodsSparse Evolutionary Training · Focus