STEP: Learning N:M Structured Sparsity Masks from Scratch with   Precondition

Yucheng Lu; Shivani Agrawal; Suvinay Subramanian; Oleg Rybakov,; Christopher De Sa; Amir Yazdanbakhsh

arXiv:2302.01172·cs.LG·February 3, 2023·1 cites

STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition

Yucheng Lu, Shivani Agrawal, Suvinay Subramanian, Oleg Rybakov,, Christopher De Sa, Amir Yazdanbakhsh

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces STEP, a novel Adam-aware method for learning N:M structured sparsity masks from scratch, which improves accuracy and robustness in sparse models across various tasks by accounting for variance estimation in Adam optimizer.

Contribution

STEP is the first adaptive recipe that incorporates variance estimation to effectively learn N:M masks with Adam, addressing accuracy issues in prior methods.

Findings

01

STEP reduces accuracy drop in sparse models.

02

It is effective across classification, translation, and language models.

03

STEP maintains robustness at high sparsity ratios.

Abstract

Recent innovations on hardware (e.g. Nvidia A100) have motivated learning N:M structured sparsity masks from scratch for fast model inference. However, state-of-the-art learning recipes in this regime (e.g. SR-STE) are proposed for non-adaptive optimizers like momentum SGD, while incurring non-trivial accuracy drop for Adam-trained models like attention-based LLMs. In this paper, we first demonstrate such gap origins from poorly estimated second moment (i.e. variance) in Adam states given by the masked weights. We conjecture that learning N:M masks with Adam should take the critical regime of variance estimation into account. In light of this, we propose STEP, an Adam-aware recipe that learns N:M masks with two phases: first, STEP calculates a reliable variance estimate (precondition phase) and subsequently, the variance remains fixed and is used as a precondition to learn N:M masks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huyz2023/2by4-pretrain
pytorch

Videos

STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition· slideslive

Taxonomy

TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Generative Adversarial Networks and Image Synthesis

MethodsStochastic Gradient Descent · Adam