Generalizing Scaling Laws for Dense and Sparse Large Language Models
Md Arafat Hossain, Xingfu Wu, Valerie Taylor, Ali Jannesari

TL;DR
This paper introduces a unified scaling law applicable to both dense and sparse large language models, improving predictions of model size and resource allocation for various architectures.
Contribution
It proposes a generalized empirical scaling law that captures the behavior of both dense and sparse LLMs, unifying existing models into a single framework.
Findings
The proposed scaling law accurately models existing dense and sparse LLMs.
It outperforms previous laws in predicting model behavior across architectures.
Demonstrates effectiveness for Mixture-of-Expert LLMs like DeepSeek-V3.
Abstract
Despite recent advancements of large language models (LLMs), optimally predicting the model size for LLM pretraining or allocating optimal resources still remains a challenge. Several efforts have addressed the challenge by proposing different empirical scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing empirical scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws and demonstrate that our proposed scaling law captures the scaling behavior of existing scaling laws. Further, we show an IsoFLOP comparison between our proposed scaling law and the state-of-the-art scaling law to illustrate the effectiveness of our proposed scaling law for Mixture-of-Expert…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
