Low-Rank Correction for Quantized LLMs
Meyer Scetbon, James Hensman

TL;DR
This paper introduces a low-rank correction method for quantized large language models, significantly improving accuracy by adding full-precision low-rank matrices to correct quantization errors in activations.
Contribution
It proposes a novel joint optimization approach for quantizing weights and activations using low-rank matrices, enhancing post-training model compression for LLMs.
Findings
Reduces accuracy gap by over 50% with 10% rank matrices.
Achieves complete accuracy recovery at 30% rank.
Demonstrates effectiveness on Llama-2, Llama-3, Phi-3, and Mixtral models.
Abstract
We consider the problem of model compression for Large Language Models (LLMs) at post-training time, where the task is to compress a well-trained model using only a small set of calibration input data. In this work, we introduce a new low-rank approach to correct for quantization errors of \emph{activations} in LLMs: we propose to add low-rank weight matrices in full precision that act on the \emph{unquantized} activations. We then solve a joint optimization problem over the quantized representation of the weights and additional low-rank weight matrices to quantize both weights and activations. We focus on the case of 4-bit weight-and-activation quantization (W4A4). Using ranks equivalent to 10\% of the original weight matrix size, our approach reduces the accuracy gap with the original model by more than 50\%. Using ranks equivalent to 30\% of the original weight matrix, the accuracy…
Peer Reviews
Decision·Submitted to ICLR 2025
- They paper is written well and easy to understand. The scheme they propose is sound and intuitive in its formulation. - Their method can use any weight quantization technique as a subroutine (they use GPTQ in the paper), which allows other tools/papers to plugin their own method. - They perform sensible ablations in order to clearly identify the impact of the weight only quantization vs activation quantization, and when low rank error correction offers value. Moreover, they also show that the
# Major - *Limited Contribution*: The paper stitches together many well known building blocks in the PTQ literature to build a sane, effective technique. In my opinion, it is a sound engineering feat, but still has high overlap with the previous work on the topic by Zhang et al (2024) and Ou et al (2024). The authors do differentiate themselves by the fact that they do a joint optimization over the low rank and quantized matrices which is key to the delta over the previous work. However, this is
* The idea of introducing low-rank adaptation to correct the quantization error is good, and trivially effective. In my perspective, these adaptation-based methods are worthy of further and comprehensive study. * The derivation in this paper provides concise intuition, which is easy to follow. * The topic of efficient LLM deployment is becoming vital currently, this method has considerable potential in addressing such PTQ problems on LLMs.
* It is not clear how the rank of adaptation would influence the efficiency. This would become my main concern for this paper. I strongly recommend the authors to add an experiment to evaluate its enhancement of memory usage and speed. * The presentation of this paper is good, but not excellent enough. The authors should add an introduction of GPTQ method and Cholesky in their appendix (since they are parts of the main algorithm) for presenting to the broader audience. * The choice of dataset in
This paper is its novel approach to quantizing large language models to 4-bit weights and activations while maintaining high accuracy. The LRC method's ability to optimize jointly for a quantized weight matrix and a full-precision low-rank correction matrix, which is connected to the original unquantized activations, effectively reduces quantization error. This innovative technique sets LRC apart from previous approaches and demonstrates its potential for enabling highly compressed models with m
The paper does not analyze the computational cost associated with the added low-rank correction matrix. While the method effectively reduces quantization error, the impact on inference time and memory usage is not thoroughly explored. This is an important consideration for the practical deployment of the LRC method. The authors leave the ideal implementation of the low-rank computation for future work. Without a concrete implementation strategy, it may be difficult for practitioners to immediat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMagnetic confinement fusion research · Particle accelerators and beam dynamics · Particle Accelerators and Free-Electron Lasers
MethodsSparse Evolutionary Training · Focus
