Compressing Large Language Models using Low Rank and Low Precision Decomposition
Rajarshi Saha, Naomi Sagan, Varun Srivastava, Andrea J. Goldsmith,, Mert Pilanci

TL;DR
This paper presents CALDERA, a novel post-training compression method for large language models that combines low-rank and low-precision decomposition to significantly reduce model size while maintaining performance.
Contribution
CALDERA introduces a new low-rank, low-precision decomposition technique for LLM compression, with theoretical error bounds and improved performance over existing methods.
Findings
Outperforms existing compression techniques at less than 2.5 bits per parameter.
Effectively compresses LLaMa-2 and LLaMa-3 models with minimal performance loss.
Provides theoretical bounds on approximation error and tradeoffs between compression ratio and accuracy.
Abstract
The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix by approximating it via a low-rank, low-precision decomposition as . Here, and are low rank factors, and the entries of , and are quantized. The model is compressed by substituting each layer with its decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, and are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. obtains this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
