ReALLM: A general framework for LLM compression and fine-tuning

Louis Leconte; Lisa Bedin; Van Minh Nguyen; Eric Moulines

arXiv:2405.13155·cs.LG·May 24, 2024·1 cites

ReALLM: A general framework for LLM compression and fine-tuning

Louis Leconte, Lisa Bedin, Van Minh Nguyen, Eric Moulines

PDF

Open Access

TL;DR

ReALLM is a versatile framework that compresses and fine-tunes large language models efficiently using low-bit quantization and a novel matrix decomposition approach, enabling high performance with minimal memory usage.

Contribution

ReALLM introduces a unified method combining post-training quantization and fine-tuning for language models using less than 4 bits per parameter, with adaptive matrix representations and a neural decoder.

Findings

01

Achieves state-of-the-art results at 2-bit quantization after fine-tuning.

02

Outperforms existing methods on language generation benchmarks.

03

Requires only one forward pass for matrix decompression.

Abstract

We introduce ReALLM, a novel approach for compression and memory-efficient adaptation of pre-trained language models that encompasses most of the post-training quantization and fine-tuning methods for a budget of <4 bits. Pre-trained matrices are decomposed into a high-precision low-rank component and a vector-quantized latent representation (using an autoencoder). During the fine-tuning step, only the low-rank components are updated. Our results show that pre-trained matrices exhibit different patterns. ReALLM adapts the shape of the encoder (small/large embedding, high/low bit VQ, etc.) to each matrix. ReALLM proposes to represent each matrix with a small embedding on $b$ bits and a neural decoder model $D_{ϕ}$ with its weights on $b_{ϕ}$ bits. The decompression of a matrix requires only one embedding and a single forward pass with the decoder. Our weight-only quantization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques