On the adequacy of untuned warmup for adaptive optimization

Jerry Ma; Denis Yarats

arXiv:1910.04209·cs.LG·March 23, 2021·22 cites

On the adequacy of untuned warmup for adaptive optimization

Jerry Ma, Denis Yarats

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the role of warmup in adaptive optimizers like Adam, showing that simple untuned warmup performs comparably to more complex rectified methods such as RAdam, and provides practical guidelines for warmup schedules.

Contribution

The paper challenges the necessity of complex warmup schemes like RAdam, demonstrating that untuned linear warmup is sufficient in typical training scenarios.

Findings

01

Untuned warmup performs similarly to RAdam in practice.

02

Warmup based on update magnitude is more relevant than variance rectification.

03

A simple linear warmup over 2/(1 - β2) iterations is recommended.

Abstract

Adaptive optimization algorithms such as Adam are widely used in deep learning. The stability of such algorithms is often improved with a warmup schedule for the learning rate. Motivated by the difficulty of choosing and tuning warmup schedules, recent work proposes automatic variance rectification of Adam's adaptive learning rate, claiming that this rectified approach ("RAdam") surpasses the vanilla Adam algorithm and reduces the need for expensive tuning of Adam with warmup. In this work, we refute this analysis and provide an alternative explanation for the necessity of warmup based on the magnitude of the update term, which is of greater relevance to training stability. We then provide some "rule-of-thumb" warmup schedules, and we demonstrate that simple untuned warmup of Adam performs more-or-less identically to RAdam in typical practical settings. We conclude by suggesting that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Tony-Y/pytorch_warmup
pytorch

Videos

On the Adequacy of Untuned Warmup for Adaptive Optimization· underline

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Machine Learning and Algorithms

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Average Pooling · Label Smoothing · Dropout · Byte Pair Encoding · Dense Connections · Softmax · Multi-Head Attention