$\mu$-Parametrization for Mixture of Experts
Jan Ma{\l}a\'snicki, Kamil Ciebiera, Mateusz Boru\'n, Maciej Pi\'oro, Jan Ludziejewski, Maciej Stefaniak, Micha{\l} Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, and Jakub Krajewski

TL;DR
This paper introduces a $bbb$-Parameterization for Mixture of Experts models, enabling reliable transfer of hyperparameters like learning rate across different model sizes, thus reducing tuning costs in large-scale MoE architectures.
Contribution
It develops a theoretical $bbb$-Parameterization for MoE, providing guarantees for feature learning and hyperparameter transfer across model scales.
Findings
Optimal learning rate transfers reliably across model sizes.
The $bbb$-Parameterization improves hyperparameter tuning efficiency.
Experimental results confirm theoretical guarantees for MoE models.
Abstract
Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a -Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
