STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition
Yucheng Lu, Shivani Agrawal, Suvinay Subramanian, Oleg Rybakov,, Christopher De Sa, Amir Yazdanbakhsh

TL;DR
This paper introduces STEP, a novel Adam-aware method for learning N:M structured sparsity masks from scratch, which improves accuracy and robustness in sparse models across various tasks by accounting for variance estimation in Adam optimizer.
Contribution
STEP is the first adaptive recipe that incorporates variance estimation to effectively learn N:M masks with Adam, addressing accuracy issues in prior methods.
Findings
STEP reduces accuracy drop in sparse models.
It is effective across classification, translation, and language models.
STEP maintains robustness at high sparsity ratios.
Abstract
Recent innovations on hardware (e.g. Nvidia A100) have motivated learning N:M structured sparsity masks from scratch for fast model inference. However, state-of-the-art learning recipes in this regime (e.g. SR-STE) are proposed for non-adaptive optimizers like momentum SGD, while incurring non-trivial accuracy drop for Adam-trained models like attention-based LLMs. In this paper, we first demonstrate such gap origins from poorly estimated second moment (i.e. variance) in Adam states given by the masked weights. We conjecture that learning N:M masks with Adam should take the critical regime of variance estimation into account. In light of this, we propose STEP, an Adam-aware recipe that learns N:M masks with two phases: first, STEP calculates a reliable variance estimate (precondition phase) and subsequently, the variance remains fixed and is used as a precondition to learn N:M masks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Generative Adversarial Networks and Image Synthesis
MethodsStochastic Gradient Descent · Adam
