DeltaLLM: Compress LLMs with Low-Rank Deltas between Shared Weights

Liana Mikaelyan; Ayyoob Imani; Mathew Salvaris; Parth Pathak; Mohsen; Fayyaz

arXiv:2501.18596·cs.LG·February 25, 2025

DeltaLLM: Compress LLMs with Low-Rank Deltas between Shared Weights

Liana Mikaelyan, Ayyoob Imani, Mathew Salvaris, Parth Pathak, Mohsen, Fayyaz

PDF

Open Access

TL;DR

DeltaLLM introduces a novel post-training compression method for large language models that employs low-rank weight differences and shared weights, achieving significant parameter reduction while maintaining performance.

Contribution

The paper presents a new structure for LLMs with shared weights and low-rank differences, enabling effective compression with minimal fine-tuning and outperforming existing methods.

Findings

01

Achieves 12% parameter reduction with 90% performance retention.

02

Outperforms existing compression techniques like JointDrop and LaCo.

03

Models like DeltaPhi 2.9B match larger models' accuracy with fewer parameters.

Abstract

We introduce DeltaLLM, a new post-training compression technique to reduce the memory footprint of LLMs. We propose an alternative way of structuring LLMs with weight sharing between layers in subsequent Transformer blocks, along with additional low-rank difference matrices between them. For training, we adopt the progressing module replacement method and show that the lightweight training of the low-rank modules with approximately 30M-40M tokens is sufficient to achieve performance on par with LLMs of comparable sizes trained from scratch. We release the resultant models, DeltaLLAMA and DeltaPHI, with a 12% parameter reduction, retaining 90% of the performance of the base Llama and Phi models on common knowledge and reasoning benchmarks. Our method also outperforms compression techniques JointDrop, LaCo, ShortGPT and SliceGPT with the same number of parameters removed. For example,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Algorithms and Data Compression

MethodsAttention Is All You Need · Linear Layer · Dense Connections · ADaptive gradient method with the OPTimal convergence rate · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Softmax · Balanced Selection