Sparse maximal update parameterization: A holistic approach to sparse training dynamics

Nolan Dey; Shane Bergsma; Joel Hestness

arXiv:2405.15743·cs.LG·February 4, 2026

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

Nolan Dey, Shane Bergsma, Joel Hestness

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces SμPar, a holistic parameterization approach that stabilizes training dynamics and reduces hyperparameter tuning costs for sparse neural networks, enabling them to outperform dense models at high sparsity levels.

Contribution

SμPar reparameterizes hyperparameters and scales activations, gradients, and updates independently of sparsity, facilitating transferability and improved performance in sparse neural networks.

Findings

01

Up to 11.9% relative loss improvement at 99.2% sparsity.

02

Hyperparameters tuned on small dense models transfer effectively to large sparse models.

03

SμPar enhances training stability and reduces tuning costs for sparse networks.

Abstract

Several challenges make it difficult for sparse neural networks to compete with dense models. First, setting a large fraction of weights to zero impairs forward and gradient signal propagation. Second, sparse studies often need to test multiple sparsity levels, while also introducing new hyperparameters (HPs), leading to prohibitive tuning costs. Indeed, the standard practice is to re-use the learning HPs originally crafted for dense models. Unfortunately, we show sparse and dense networks do not share the same optimal HPs. Without stable dynamics and effective training recipes, it is costly to test sparsity at scale, which is key to surpassing dense networks and making the business case for sparsity acceleration in hardware. A holistic approach is needed to tackle these challenges and we propose S $μ$ Par as one such approach. For random unstructured static sparsity, S $μ$ Par…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eleutherai/nanogpt-mup
pytorchOfficial

Videos

Sparse maximal update parameterization: A holistic approach to sparse training dynamics· slideslive

Taxonomy

TopicsModel Reduction and Neural Networks · Speech and Audio Processing · Advanced Vision and Imaging