The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws
Tian Jin, Ahmed Imtiaz Humayun, Utku Evci, Suvinay Subramanian, Amir, Yazdanbakhsh, Dan Alistarh, Gintare Karolina Dziugaite

TL;DR
This paper investigates optimal sparse pre-training configurations for large language models, introduces a unifying scaling law based on average parameter count, and demonstrates that sparse pre-training can match dense models' performance with reduced inference costs.
Contribution
It systematically explores sparse pre-training schedules, proposes a modified scaling law using average parameter count, and empirically validates its effectiveness across sparse and dense models.
Findings
Optimal pruning at 25%-75% of training compute
Modified scaling law accurately models evaluation loss
Sparse pre-training matches dense performance with fewer parameters
Abstract
Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large language models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗tjingrant/sparsellm-1b-40pmodel
- 🤗tjingrant/sparsellm-1b-60p-small-densemodel
- 🤗tjingrant/sparsellm-1b-80pmodel· 1 dl1 dl
- 🤗tjingrant/sparsellm-1b-60pmodel
- 🤗tjingrant/sparsellm-1b-20pmodel
- 🤗tjingrant/sparsellm-1b-80p-small-densemodel
- 🤗tjingrant/sparsellm-1b-40p-small-densemodel
- 🤗tjingrant/sparsellm-1b-20p-small-densemodel· 12 dl12 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Neural Networks and Applications · Data Mining Algorithms and Applications
MethodsChinchilla · Pruning · Focus
