Lillama: Large Language Models Compression via Low-Rank Feature   Distillation

Yaya Sy; Christophe Cerisara; Irina Illina

arXiv:2412.16719·cs.LG·December 31, 2024

Lillama: Large Language Models Compression via Low-Rank Feature Distillation

Yaya Sy, Christophe Cerisara, Irina Illina

PDF

Open Access

TL;DR

Lillama introduces a low-rank activation distillation method for compressing large language models efficiently, achieving high compression ratios with minimal performance loss and faster convergence.

Contribution

The paper proposes a novel low-rank feature distillation approach that accelerates LLM compression, reducing memory and time requirements while maintaining high accuracy.

Findings

01

Compresses Mixtral-8x7B by 10 billion parameters in minutes.

02

Retains over 95% of original performance after compression.

03

Generalizes to non-transformer architectures like Mamba-3B.

Abstract

Current LLM structured pruning methods typically involve two steps: (1) compression with calibration data and (2) costly continued pretraining on billions of tokens to recover lost performance. This second step is necessary as the first significantly impacts model accuracy. Prior research suggests pretrained Transformer weights aren't inherently low-rank, unlike their activations, which may explain this drop. Based on this observation, we propose Lillama, a compression method that locally distills activations with low-rank weights. Using SVD for initialization and a joint loss combining teacher and student activations, we accelerate convergence and reduce memory use with local gradient updates. Lillama compresses Mixtral-8x7B within minutes on a single A100 GPU, removing 10 billion parameters while retaining over 95% of its original performance. Phi-2 3B can be compressed by 40% with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Text and Document Classification Technologies · Algorithms and Data Compression

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Adam