To 2:4 Sparsity and Beyond: Neuron-level Activation Function to Accelerate LLM Pre-Training
Meghana Madhyastha, Daniel Haziza, Jesse Cai, Newsha Ardalani, Zhiqi Bu, Carole-Jean Wu

TL;DR
This paper introduces a neuron-level activation function leveraging hardware-accelerated sparsity to significantly speed up large language model pretraining without performance loss, achieving 1.4 to 1.7 times faster training.
Contribution
It proposes a novel sparsity-based method for accelerating LLM pretraining, applicable across hardware and compatible with existing optimization techniques.
Findings
Models trained with this method match baseline performance.
Training speed increases by 1.4 to 1.7 times.
Applicable to NVIDIA A100 GPUs and mixture-of-experts architectures.
Abstract
Trainings of Large Language Models are generally bottlenecked by matrix multiplications. In the Transformer architecture, a large portion of these operations happens in the Feed Forward Network (FFN), and this portion increases for larger models, up to 50% of the total pretraining floating point operations. We show that we can leverage hardware-accelerated sparsity to accelerate all matrix multiplications in the FFN, with 2:4 sparsity for weights and v:n:m (Venom) sparsity for activations. Our recipe relies on sparse training steps to accelerate a large part of the pretraining, associated with regular dense training steps towards the end. Overall, models trained with this approach exhibit the same performance on our quality benchmarks, and can speed up training end-to-end by 1.4 to 1.7x. This approach is applicable to all NVIDIA GPUs starting with the A100 generation, and is orthogonal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Generative Adversarial Networks and Image Synthesis
