Enabling High-Sparsity Foundational Llama Models with Efficient   Pretraining and Deployment

Abhinav Agarwalla; Abhay Gupta; Alexandre Marques; Shubhra Pandit,; Michael Goin; Eldar Kurtic; Kevin Leong; Tuan Nguyen; Mahmoud Salem; Dan; Alistarh; Sean Lie; Mark Kurtz

arXiv:2405.03594·cs.CL·May 7, 2024

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit,, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan, Alistarh, Sean Lie, Mark Kurtz

PDF

Open Access 10 Models

TL;DR

This paper presents a method to create high-sparsity, efficient LLaMA models that maintain full accuracy, enabling faster training and inference on various hardware while preserving performance across multiple NLP tasks.

Contribution

The authors introduce a novel sparse pretraining and pruning approach for LLaMA models that achieves up to 70% sparsity with full accuracy recovery, accelerating training and inference.

Findings

01

Achieved up to 70% sparsity with full accuracy recovery.

02

Realized up to 8.6x total speedup on CPUs with sparse-quantized models.

03

Demonstrated broad task performance including chat, coding, and reasoning.

Abstract

Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDynamics and Control of Mechanical Systems · Robotic Mechanisms and Dynamics · Computational Geometry and Mesh Generation

MethodsLLaMA · Pruning