S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training
Yuezhou Hu, Jun Zhu, Jianfei Chen

TL;DR
This paper introduces S-STE, a novel continuous pruning method for 2:4 sparsity in neural network pre-training, overcoming optimization issues of previous methods and achieving superior efficiency and performance.
Contribution
S-STE provides a continuous projection and rescaling approach for 2:4 sparse training, addressing discontinuity problems and improving pre-training effectiveness.
Findings
Outperforms previous 2:4 pre-training methods.
Achieves results comparable to full parameter models.
Utilizes FP8 quantization for efficiency.
Abstract
Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2:4 sparsity. However, previous STE-based 2:4 pre-training methods (e.g. STE with hard-thresholding, SR-STE) suffer from optimization difficulties because of discontinuous pruning function. In this study, we comprehensively analyse the bottleneck of traditional N:M sparse training and recognize three drawbacks with discontinuity: incorrect descending direction, inability to predict the amount of descent and sparse mask oscillation. In light of this, we propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor. Besides, we adopt minimum-variance unbiased estimation for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsPruning
