Lillama: Large Language Models Compression via Low-Rank Feature Distillation
Yaya Sy, Christophe Cerisara, Irina Illina

TL;DR
Lillama introduces a low-rank activation distillation method for compressing large language models efficiently, achieving high compression ratios with minimal performance loss and faster convergence.
Contribution
The paper proposes a novel low-rank feature distillation approach that accelerates LLM compression, reducing memory and time requirements while maintaining high accuracy.
Findings
Compresses Mixtral-8x7B by 10 billion parameters in minutes.
Retains over 95% of original performance after compression.
Generalizes to non-transformer architectures like Mamba-3B.
Abstract
Current LLM structured pruning methods typically involve two steps: (1) compression with calibration data and (2) costly continued pretraining on billions of tokens to recover lost performance. This second step is necessary as the first significantly impacts model accuracy. Prior research suggests pretrained Transformer weights aren't inherently low-rank, unlike their activations, which may explain this drop. Based on this observation, we propose Lillama, a compression method that locally distills activations with low-rank weights. Using SVD for initialization and a joint loss combining teacher and student activations, we accelerate convergence and reduce memory use with local gradient updates. Lillama compresses Mixtral-8x7B within minutes on a single A100 GPU, removing 10 billion parameters while retaining over 95% of its original performance. Phi-2 3B can be compressed by 40% with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Text and Document Classification Technologies · Algorithms and Data Compression
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Adam
