Compressing Large Language Models using Low Rank and Low Precision   Decomposition

Rajarshi Saha; Naomi Sagan; Varun Srivastava; Andrea J. Goldsmith,; Mert Pilanci

arXiv:2405.18886·cs.LG·November 5, 2024·1 cites

Compressing Large Language Models using Low Rank and Low Precision Decomposition

Rajarshi Saha, Naomi Sagan, Varun Srivastava, Andrea J. Goldsmith,, Mert Pilanci

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents CALDERA, a novel post-training compression method for large language models that combines low-rank and low-precision decomposition to significantly reduce model size while maintaining performance.

Contribution

CALDERA introduces a new low-rank, low-precision decomposition technique for LLM compression, with theoretical error bounds and improved performance over existing methods.

Findings

01

Outperforms existing compression techniques at less than 2.5 bits per parameter.

02

Effectively compresses LLaMa-2 and LLaMa-3 models with minimal performance loss.

03

Provides theoretical bounds on approximation error and tradeoffs between compression ratio and accuracy.

Abstract

The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces $CALDERA$ -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix $W$ by approximating it via a low-rank, low-precision decomposition as $W \approx Q + LR$ . Here, $L$ and $R$ are low rank factors, and the entries of $Q$ , $L$ and $R$ are quantized. The model is compressed by substituting each layer with its $Q + LR$ decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, $L$ and $R$ are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. $CALDERA$ obtains this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pilancilab/caldera
pytorchOfficial

Videos

Compressing Large Language Models using Low Rank and Low Precision Decomposition· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis