Scaling Laws for Fine-Grained Mixture of Experts
Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pi\'oro,, Micha{\l} Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Kr\'ol, Tomasz, Odrzyg\'o\'zd\'z, Piotr Sankowski, Marek Cygan, Sebastian Jaszczur

TL;DR
This paper derives scaling laws for fine-grained Mixture of Experts models, revealing how to optimize their configuration for efficiency and outperform dense Transformers as models grow larger.
Contribution
It introduces a new hyperparameter, granularity, and establishes scaling laws for fine-grained MoE, guiding optimal training configurations and highlighting their advantages over dense models.
Findings
MoE models outperform dense Transformers across scales.
Efficiency gap widens with larger models and budgets.
Optimal expert size differs from traditional mirror settings.
Abstract
Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity. Leveraging these laws, we derive the optimal training configuration for a given computational budget. Our findings not only show that MoE models consistently outperform dense Transformers but also highlight that the efficiency gap between dense and MoE models widens as we scale up the model size and training budget. Furthermore, we demonstrate that the common practice of setting the size of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Process Monitoring
