Sparse maximal update parameterization: A holistic approach to sparse training dynamics
Nolan Dey, Shane Bergsma, Joel Hestness

TL;DR
This paper introduces SμPar, a holistic parameterization approach that stabilizes training dynamics and reduces hyperparameter tuning costs for sparse neural networks, enabling them to outperform dense models at high sparsity levels.
Contribution
SμPar reparameterizes hyperparameters and scales activations, gradients, and updates independently of sparsity, facilitating transferability and improved performance in sparse neural networks.
Findings
Up to 11.9% relative loss improvement at 99.2% sparsity.
Hyperparameters tuned on small dense models transfer effectively to large sparse models.
SμPar enhances training stability and reduces tuning costs for sparse networks.
Abstract
Several challenges make it difficult for sparse neural networks to compete with dense models. First, setting a large fraction of weights to zero impairs forward and gradient signal propagation. Second, sparse studies often need to test multiple sparsity levels, while also introducing new hyperparameters (HPs), leading to prohibitive tuning costs. Indeed, the standard practice is to re-use the learning HPs originally crafted for dense models. Unfortunately, we show sparse and dense networks do not share the same optimal HPs. Without stable dynamics and effective training recipes, it is costly to test sparsity at scale, which is key to surpassing dense networks and making the business case for sparsity acceleration in hardware. A holistic approach is needed to tackle these challenges and we propose SPar as one such approach. For random unstructured static sparsity, SPar…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsModel Reduction and Neural Networks · Speech and Audio Processing · Advanced Vision and Imaging
