Hyperparameter Transfer with Mixture-of-Expert Layers
Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, Boris Hanin

TL;DR
This paper introduces a new parameterization for transformer models with Mixture-of-Experts layers, enabling reliable hyperparameter transfer across different model sizes and training conditions.
Contribution
It proposes a novel parameterization justified by dynamical mean-field theory, facilitating hyperparameter transfer in MoE models across various scales.
Findings
Hyperparameter transfer is reliable across models from 51M to 2B parameters.
The new parameterization enables effective scaling of MoE models.
Hyperparameters from small models can be used to train larger models efficiently.
Abstract
Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Advanced Graph Neural Networks
