Tensor Programs V: Tuning Large Neural Networks via Zero-Shot   Hyperparameter Transfer

Greg Yang; Edward J. Hu; Igor Babuschkin; Szymon Sidor; Xiaodong Liu,; David Farhi; Nick Ryder; Jakub Pachocki; Weizhu Chen; Jianfeng Gao

arXiv:2203.03466·cs.LG·March 29, 2022·22 cites

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu,, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, Jianfeng Gao

PDF

Open Access 5 Repos 5 Models 1 Video

TL;DR

This paper introduces muTransfer, a zero-shot hyperparameter transfer method for large neural networks, enabling effective tuning on smaller models and transferring those hyperparameters to larger models, reducing costs significantly.

Contribution

The paper proposes muTransfer, a novel hyperparameter tuning paradigm that leverages the stability of optimal hyperparameters across model sizes in muP parametrization, enabling zero-shot transfer from small to large models.

Findings

01

Transferring hyperparameters from small to large models improves performance.

02

muTransfer reduces hyperparameter tuning costs by orders of magnitude.

03

Effective on Transformer and ResNet architectures.

Abstract

Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call muTransfer: parametrize the target model in muP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all. We verify muTransfer on Transformer and ResNet. For example, 1) by transferring pretraining HPs from a model of 13M parameters, we outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; 2) by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (μTransfer)· youtube

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Computational Physics and Python Applications

MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Cosine Annealing · Attention Dropout · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · 1x1 Convolution · Linear Warmup With Cosine Annealing