$\mu$-Parametrization for Mixture of Experts

Jan Ma{\l}a\'snicki; Kamil Ciebiera; Mateusz Boru\'n; Maciej Pi\'oro; Jan Ludziejewski; Maciej Stefaniak; Micha{\l} Krutul; Sebastian Jaszczur; Marek Cygan; Kamil Adamczewski; and Jakub Krajewski

arXiv:2508.09752·cs.LG·October 10, 2025

$\mu$-Parametrization for Mixture of Experts

Jan Ma{\l}a\'snicki, Kamil Ciebiera, Mateusz Boru\'n, Maciej Pi\'oro, Jan Ludziejewski, Maciej Stefaniak, Micha{\l} Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, and Jakub Krajewski

PDF

TL;DR

This paper introduces a $bbb$-Parameterization for Mixture of Experts models, enabling reliable transfer of hyperparameters like learning rate across different model sizes, thus reducing tuning costs in large-scale MoE architectures.

Contribution

It develops a theoretical $bbb$-Parameterization for MoE, providing guarantees for feature learning and hyperparameter transfer across model scales.

Findings

01

Optimal learning rate transfers reliably across model sizes.

02

The $bbb$-Parameterization improves hyperparameter tuning efficiency.

03

Experimental results confirm theoretical guarantees for MoE models.

Abstract

Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$ T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $μ$ Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $μ$ -Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.