SDMPrune: Self-Distillation MLP Pruning for Efficient Large Language Models

Hourun Zhu; Chengchao Shen

arXiv:2506.11120·cs.CL·June 16, 2025

SDMPrune: Self-Distillation MLP Pruning for Efficient Large Language Models

Hourun Zhu, Chengchao Shen

PDF

Open Access 1 Repo

TL;DR

SDMPrune introduces a self-distillation based pruning method focusing on MLP modules in large language models, significantly reducing parameters while maintaining performance, and outperforming existing pruning techniques.

Contribution

The paper proposes a novel self-distillation loss during pruning, specifically targeting MLP modules to improve compression of large language models without performance loss.

Findings

01

Outperforms existing pruning methods on zero-shot benchmarks.

02

Achieves significant parameter reduction in LLMs with minimal performance degradation.

03

Competitive results among 1B-scale open source LLMs.

Abstract

In spite of strong performance achieved by LLMs, the costs of their deployment are unaffordable. For the compression of LLMs, gradient-based pruning methods present promising effectiveness. However, in these methods, the gradient computation with one-hot labels ignore the potential predictions on other words, thus missing key information for generative capability of the original model. To address this issue, we introduce a self-distillation loss during the pruning phase (rather than post-training) to fully exploit the predictions of the original model, thereby obtaining more accurate gradient information for pruning. Moreover, we find that, compared to attention modules, the predictions of LLM are less sensitive to multilayer perceptron (MLP) modules, which take up more than $5 \times$ parameters (LLaMA3.2-1.2B). To this end, we focus on the pruning of MLP modules, to significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

visresearch/SDMPrune
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need · Focus · Pruning