Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu,, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, Jianfeng Gao

TL;DR
This paper introduces muTransfer, a zero-shot hyperparameter transfer method for large neural networks, enabling effective tuning on smaller models and transferring those hyperparameters to larger models, reducing costs significantly.
Contribution
The paper proposes muTransfer, a novel hyperparameter tuning paradigm that leverages the stability of optimal hyperparameters across model sizes in muP parametrization, enabling zero-shot transfer from small to large models.
Findings
Transferring hyperparameters from small to large models improves performance.
muTransfer reduces hyperparameter tuning costs by orders of magnitude.
Effective on Transformer and ResNet architectures.
Abstract
Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call muTransfer: parametrize the target model in muP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all. We verify muTransfer on Transformer and ResNet. For example, 1) by transferring pretraining HPs from a model of 13M parameters, we outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; 2) by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (μTransfer)· youtube
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Computational Physics and Python Applications
MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Cosine Annealing · Attention Dropout · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · 1x1 Convolution · Linear Warmup With Cosine Annealing
