Scaling Laws for Sparsely-Connected Foundation Models
Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci

TL;DR
This paper establishes a new scaling law for sparsely-connected foundation models, revealing how sparsity affects performance and optimal configurations across vision and language transformers trained on large datasets.
Contribution
It introduces the first empirical scaling law linking weight sparsity, model size, and training data, with insights into optimal sparsity levels and structures for foundation models.
Findings
Optimal sparsity increases with training data size.
The proposed law is validated across vision and language models.
Different sparsity structures and pretraining strategies are analyzed.
Abstract
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data, which we validate empirically across model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to characterize the "optimal sparsity", the sparsity level which yields the best performance for a given effective model size and training budget. For a fixed number of non-zero parameters, we identify that the optimal sparsity increases with the amount of data used for training. We also extend our study to different sparsity structures (such as the hardware-friendly n:m pattern) and strategies (such as starting from a pretrained dense model).…
Peer Reviews
Decision·ICLR 2024 spotlight
1. This paper investigates a topic highly relevant to current trends in large-scale models, namely the scaling laws of sparse large models. The authors present the formulation of scaling laws along with their derivation process and validate them on T5 and ViT models. 2. The authors also validate several applications related to scaling laws, providing reasonable experimental designs and detailed experimental results. 3. The writing throughout the article is quite commendable.
1. This paper posits that sparsity can reduce training costs, thus presenting research on optimal sparsity. However, in reality, the training costs of current sparse training methods are often the same as those without sparsity. Under these circumstances, is it meaningful to study optimal sparsity?
This study is the first paper to use large-scale experiments and theoretical analysis to explore in detail the effects of sparsity on neural network training and efficiency. The authors propose new scaling laws describing the relationship between sparsity, the number of non-zero parameters, and the amount of training data, which is important. On the one hand, the findings of this paper can help us configure the training parameters more scientifically and thus improve the efficiency and performan
Overall, I think this is a good paper; my only concern lies in the scale of the data and modeling. Compared to Chinchilla's law, the author's data and model sizes are significantly smaller. Also, since the authors mentioned the optimal sparsity setting as an empirical design or product of this claim, could you elaborate on what specific improvements in downstream task or operational efficiency the sparsity of our actual model would bring us under the authors' theoretical guarantee?
1. The experimental results are comprehensive. I appreciate that the authors include experiments on hardware friendly structured sparsity. 2. Deriving scaling laws for sparse foundation models is important. As foundation models require huge compute to train, understanding their scaling behavior and relationship to sparsity is crucial for predicting model performance. 3. The empirical results show that ViT and T5 show strong scaling curves for sparsity and their performances can be well predi
1. My main concern lies in the contribution of section 2 on fair evaluation, which argued that a sparse network should be compared with a dense model with the same number of parameters and compute. A very similar argument is made in [1] (See “training budget” in section 3), which argued that the compute budget for comparing sparse and dense models should be the same. It would be good to clarify the differences to [1] in the paper. 2. The evaluation on the encoder-decoder architecture T5 seems li
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications
