LLM Pruning and Distillation in Practice: The Minitron Approach
Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi,, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng,, Zijia Chen, Yoshi Suhara, Shizhe Diao, Chenhan Yu, Wei-Chun Chen, Hayley, Ross, Oluwatobi Olabiyi, Ashwath Aithal, Oleksii Kuchaiev

TL;DR
This paper demonstrates effective pruning and distillation techniques to compress large language models like Llama 3.1 and Mistral NeMo, resulting in smaller models with maintained performance, and shares the models openly.
Contribution
It introduces practical pruning and distillation methods for large language models, including two strategies and open-sourcing the compressed models.
Findings
Pruning and distillation produce high-quality smaller models.
Slight fine-tuning of teacher models improves distillation results.
Open-source release of the compressed models facilitates further research.
Abstract
We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/Llama-3.1-Minitron-4B-Width-Basemodel· 2.9k dl· ♡ 1932.9k dl♡ 193
- 🤗nvidia/Llama-3.1-Minitron-4B-Depth-Basemodel· 1.0k dl· ♡ 211.0k dl♡ 21
- 🤗nvidia/Mistral-NeMo-Minitron-8B-Basemodel· 3.0k dl· ♡ 1773.0k dl♡ 177
- 🤗RichardErkhov/nvidia_-_Mistral-NeMo-Minitron-8B-Base-ggufmodel· 88 dl· ♡ 188 dl♡ 1
- 🤗denkijin/Llama-3.1-Minitron-4B-Width-Basemodel· 2 dl2 dl
- 🤗QuantFactory/Llama-3.1-Minitron-4B-Width-Base-GGUFmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗TitanML/Mistral-NeMo-Minitron-8B-Basemodel· 4 dl4 dl
- 🤗mylesgoose/Llama-3.1-Minitron-4B-Width-Basemodel· 21 dl· ♡ 121 dl♡ 1
- 🤗da-fr/Mistral-NeMo-Minitron-8B-ARChitects-Full-bnb-4bitmodel· 2 dl· ♡ 72 dl♡ 7
- 🤗RichardErkhov/nvidia_-_Llama-3.1-Minitron-4B-Width-Base-4bitsmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMetallurgical Processes and Thermodynamics · Mineral Processing and Grinding · Extraction and Separation Processes
MethodsLLaMA · Pruning · Balanced Selection
