Improved Methods for Model Pruning and Knowledge Distillation
Wei Jiang, Anying Fu, Youling Zhang

TL;DR
This paper introduces MAMA Pruning, a novel method for reducing large language models' size and complexity while maintaining performance, using weight, bias, and reward-based indicators during pruning.
Contribution
The paper presents MAMA Pruning, an improved pruning technique that effectively reduces model size and computational load with minimal performance loss, outperforming existing methods.
Findings
MAMA Pruning maintains performance at high pruning levels.
It outperforms state-of-the-art pruning methods.
Effective across various NLP tasks.
Abstract
Model pruning is a performance optimization technique for large language models like R1 or o3-mini. However, existing pruning methods often lead to significant performance degradation or require extensive retraining and fine-tuning. This technique aims to identify and remove neurons, connections unlikely leading to the contribution during the human-computer interaction phase. Our goal is to obtain a much smaller and faster knowledge distilled model that can quickly generate content almost as good as those of the unpruned ones. We propose MAMA Pruning, short for Movement and Magnitude Analysis, an improved pruning method that effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels. The improved method is based on weights, bias fixed in the pre-training phase and GRPO rewards verified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computational Techniques and Applications
MethodsPruning
