Memory-Efficient Fine-Tuning of Compressed Large Language Models via   sub-4-bit Integer Quantization

Jeonghoon Kim; Jung Hyun Lee; Sungdong Kim; Joonsuk Park; Kang Min; Yoo; Se Jung Kwon; Dongsoo Lee

arXiv:2305.14152·cs.LG·October 31, 2023·28 cites

Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min, Yoo, Se Jung Kwon, Dongsoo Lee

PDF

Open Access 1 Video

TL;DR

This paper introduces PEQA, a method combining parameter-efficient fine-tuning with sub-4-bit quantization, enabling memory-efficient adaptation of large language models while maintaining or improving their performance.

Contribution

PEQA allows fine-tuning of quantized LLMs by updating only quantization scales, reducing memory and model size without sacrificing performance.

Findings

01

PEQA effectively fine-tunes 65B parameter models with sub-4-bit quantization.

02

Quantized LLMs with PEQA maintain or improve language understanding and reasoning.

03

Memory overhead during fine-tuning is significantly reduced.

Abstract

Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs. While parameter-efficient fine-tuning (PEFT) methods aim to reduce the memory usage of the optimizer state during fine-tuning, the inherent size of pre-trained LLM weights continues to be a pressing concern. Even though quantization techniques are widely proposed to ease memory demands and accelerate LLM inference, most of these techniques are geared towards the deployment phase. To bridge this gap, this paper presents Parameter-Efficient and Quantization-aware Adaptation (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs. By updating solely the quantization scales, PEQA can be directly applied to quantized LLMs, ensuring seamless task transitions. Parallel to existing PEFT methods, PEQA significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization· slideslive

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Speech Recognition and Synthesis