Scaling Laws for Fine-Grained Mixture of Experts

Jakub Krajewski; Jan Ludziejewski; Kamil Adamczewski; Maciej Pi\'oro,; Micha{\l} Krutul; Szymon Antoniak; Kamil Ciebiera; Krystian Kr\'ol; Tomasz; Odrzyg\'o\'zd\'z; Piotr Sankowski; Marek Cygan; Sebastian Jaszczur

arXiv:2402.07871·cs.LG·February 13, 2024·3 cites

Scaling Laws for Fine-Grained Mixture of Experts

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pi\'oro,, Micha{\l} Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Kr\'ol, Tomasz, Odrzyg\'o\'zd\'z, Piotr Sankowski, Marek Cygan, Sebastian Jaszczur

PDF

Open Access 1 Repo

TL;DR

This paper derives scaling laws for fine-grained Mixture of Experts models, revealing how to optimize their configuration for efficiency and outperform dense Transformers as models grow larger.

Contribution

It introduces a new hyperparameter, granularity, and establishes scaling laws for fine-grained MoE, guiding optimal training configurations and highlighting their advantages over dense models.

Findings

01

MoE models outperform dense Transformers across scales.

02

Efficiency gap widens with larger models and budgets.

03

Optimal expert size differs from traditional mirror settings.

Abstract

Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity. Leveraging these laws, we derive the optimal training configuration for a given computational budget. Our findings not only show that MoE models consistently outperform dense Transformers but also highlight that the efficiency gap between dense and MoE models widens as we scale up the model size and training budget. Furthermore, we demonstrate that the common practice of setting the size of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llm-random/llm-random
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Process Monitoring