ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with   Modular Quantizers

Junjie Yin; Jiahao Dong; Yingheng Wang; Christopher De Sa; Volodymyr; Kuleshov

arXiv:2309.16119·cs.LG·March 12, 2024

ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers

Junjie Yin, Jiahao Dong, Yingheng Wang, Christopher De Sa, Volodymyr, Kuleshov

PDF

Open Access 3 Repos

TL;DR

ModuLoRA introduces a memory-efficient finetuning method for large language models that enables 2-bit and 3-bit precision training on consumer GPUs by integrating modular quantization with low-rank adapters.

Contribution

It presents a novel quantization-agnostic backward pass allowing effective finetuning of ultra-low precision LLMs, outperforming previous methods in memory efficiency and performance.

Findings

01

Enables finetuning 2-bit and 3-bit LLMs for the first time

02

Achieves competitive performance on NLP tasks with less memory

03

Surpasses state-of-the-art ROUGE scores on summarization

Abstract

We propose a memory-efficient finetuning algorithm for large language models (LLMs) that supports finetuning LLMs with 65B parameters in 2/3/4-bit precision on as little as one 24GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 2-bit and 3-bit LLMs for the first time -- leveraging state-of-the-art 2-bit QuIP\# quantization and 3-bit OPTQ quantization -- outperforming finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, \lplora~attains competitive performance on text classification, natural language inference, and instruction following tasks using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLib