DeltaLLM: Compress LLMs with Low-Rank Deltas between Shared Weights
Liana Mikaelyan, Ayyoob Imani, Mathew Salvaris, Parth Pathak, Mohsen, Fayyaz

TL;DR
DeltaLLM introduces a novel post-training compression method for large language models that employs low-rank weight differences and shared weights, achieving significant parameter reduction while maintaining performance.
Contribution
The paper presents a new structure for LLMs with shared weights and low-rank differences, enabling effective compression with minimal fine-tuning and outperforming existing methods.
Findings
Achieves 12% parameter reduction with 90% performance retention.
Outperforms existing compression techniques like JointDrop and LaCo.
Models like DeltaPhi 2.9B match larger models' accuracy with fewer parameters.
Abstract
We introduce DeltaLLM, a new post-training compression technique to reduce the memory footprint of LLMs. We propose an alternative way of structuring LLMs with weight sharing between layers in subsequent Transformer blocks, along with additional low-rank difference matrices between them. For training, we adopt the progressing module replacement method and show that the lightweight training of the low-rank modules with approximately 30M-40M tokens is sufficient to achieve performance on par with LLMs of comparable sizes trained from scratch. We release the resultant models, DeltaLLAMA and DeltaPHI, with a 12% parameter reduction, retaining 90% of the performance of the base Llama and Phi models on common knowledge and reasoning benchmarks. Our method also outperforms compression techniques JointDrop, LaCo, ShortGPT and SliceGPT with the same number of parameters removed. For example,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Algorithms and Data Compression
MethodsAttention Is All You Need · Linear Layer · Dense Connections · ADaptive gradient method with the OPTimal convergence rate · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Softmax · Balanced Selection
