Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment
Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit,, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan, Alistarh, Sean Lie, Mark Kurtz

TL;DR
This paper presents a method to create high-sparsity, efficient LLaMA models that maintain full accuracy, enabling faster training and inference on various hardware while preserving performance across multiple NLP tasks.
Contribution
The authors introduce a novel sparse pretraining and pruning approach for LLaMA models that achieves up to 70% sparsity with full accuracy recovery, accelerating training and inference.
Findings
Achieved up to 70% sparsity with full accuracy recovery.
Realized up to 8.6x total speedup on CPUs with sparse-quantized models.
Demonstrated broad task performance including chat, coding, and reasoning.
Abstract
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗RedHatAI/Llama-2-7b-pruned50-retrainedmodel· 12 dl12 dl
- 🤗RedHatAI/Llama-2-7b-pruned70-retrainedmodel· 80 dl· ♡ 180 dl♡ 1
- 🤗RedHatAI/Llama-2-7b-ultrachat200kmodel· 100 dl· ♡ 1100 dl♡ 1
- 🤗RedHatAI/Llama-2-7b-ultrachat200k-pruned_50model· 6 dl6 dl
- 🤗RedHatAI/Llama-2-7b-ultrachat200k-pruned_70model· 16 dl16 dl
- 🤗RedHatAI/Llama-2-7b-ultrachat200k-pruned_50-quantized-deepsparsemodel· 13 dl13 dl
- 🤗RedHatAI/Llama-2-7b-ultrachat200k-pruned_70-quantized-deepsparsemodel· 9 dl9 dl
- 🤗RedHatAI/Llama-2-7b-evolcodealpacamodel· 38 dl· ♡ 138 dl♡ 1
- 🤗RedHatAI/Llama-2-7b-evol-code-alpaca-pruned_50model· 17 dl17 dl
- 🤗RedHatAI/Llama-2-7b-evol-code-alpaca-pruned_70model· 17 dl17 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDynamics and Control of Mechanical Systems · Robotic Mechanisms and Dynamics · Computational Geometry and Mesh Generation
MethodsLLaMA · Pruning
