S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Yuezhou Hu; Jun Zhu; Jianfei Chen

arXiv:2409.09099·cs.LG·December 30, 2024

S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Yuezhou Hu, Jun Zhu, Jianfei Chen

PDF

Open Access 2 Repos

TL;DR

This paper introduces S-STE, a novel continuous pruning method for 2:4 sparsity in neural network pre-training, overcoming optimization issues of previous methods and achieving superior efficiency and performance.

Contribution

S-STE provides a continuous projection and rescaling approach for 2:4 sparse training, addressing discontinuity problems and improving pre-training effectiveness.

Findings

01

Outperforms previous 2:4 pre-training methods.

02

Achieves results comparable to full parameter models.

03

Utilizes FP8 quantization for efficiency.

Abstract

Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2:4 sparsity. However, previous STE-based 2:4 pre-training methods (e.g. STE with hard-thresholding, SR-STE) suffer from optimization difficulties because of discontinuous pruning function. In this study, we comprehensively analyse the bottleneck of traditional N:M sparse training and recognize three drawbacks with discontinuity: incorrect descending direction, inability to predict the amount of descent and sparse mask oscillation. In light of this, we propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor. Besides, we adopt minimum-variance unbiased estimation for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsPruning