LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient   Language Model Finetuning

Han Guo; Philip Greengard; Eric P. Xing; Yoon Kim

arXiv:2311.12023·cs.CL·August 28, 2024·2 cites

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim

PDF

Open Access 1 Repo

TL;DR

LQ-LoRA introduces a memory-efficient method for fine-tuning large language models by decomposing weight matrices into low-rank and quantized parts, enabling aggressive quantization with minimal performance loss.

Contribution

The paper presents a novel low-rank plus quantized matrix decomposition technique for efficient language model fine-tuning, outperforming existing quantization baselines and enabling sub-3-bit quantization.

Findings

01

Outperforms QLoRA and GPTQ-LoRA baselines.

02

Enables quantization to below 3 bits with minor performance loss.

03

Achieves effective model compression with 2.75-bit LLaMA-2-70B.

Abstract

We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Experiments on finetuning RoBERTa and LLaMA-2 (7B and 70B) demonstrate that our low-rank plus quantized matrix decomposition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hanguo97/lq-lora
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Residual Connection · Layer Normalization · Linear Warmup With Linear Decay · Dense Connections · Dropout · Softmax · Linear Layer · WordPiece