LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim

TL;DR
LQ-LoRA introduces a memory-efficient method for fine-tuning large language models by decomposing weight matrices into low-rank and quantized parts, enabling aggressive quantization with minimal performance loss.
Contribution
The paper presents a novel low-rank plus quantized matrix decomposition technique for efficient language model fine-tuning, outperforming existing quantization baselines and enabling sub-3-bit quantization.
Findings
Outperforms QLoRA and GPTQ-LoRA baselines.
Enables quantization to below 3 bits with minor performance loss.
Achieves effective model compression with 2.75-bit LLaMA-2-70B.
Abstract
We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Experiments on finetuning RoBERTa and LLaMA-2 (7B and 70B) demonstrate that our low-rank plus quantized matrix decomposition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Residual Connection · Layer Normalization · Linear Warmup With Linear Decay · Dense Connections · Dropout · Softmax · Linear Layer · WordPiece
