Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

Atli Kosson; Bettina Messmer; Martin Jaggi

arXiv:2410.23922·cs.LG·November 1, 2024

Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

Atli Kosson, Bettina Messmer, Martin Jaggi

PDF

Open Access

TL;DR

This paper investigates why learning rate warmup benefits GPT training and proposes methods to reduce or eliminate warmup by normalizing optimizer updates, leading to more efficient training.

Contribution

The study provides new insights into warmup's role in controlling update sizes and introduces normalization techniques to lessen warmup dependency in GPT training.

Findings

01

Warmup counters large angular updates early in training.

02

Limited critical batch size is a key factor in warmup necessity.

03

Optimizer normalization can significantly reduce or remove warmup requirements.

Abstract

Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size $Δ w_{t} = η_{t} u_{t}$ early in training by using lower values for the learning rate $η_{t}$ . In this work we argue that warmup benefits training by keeping the overall size of $Δ w_{t}$ limited, counteracting large initial values of $u_{t}$ . Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: Why and by which criteria are early updates $u_{t}$ too large? We analyze different metrics for the update size including the $ℓ_{2}$ -norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Adam · Attention Dropout · Multi-Head Attention · Softmax · Weight Decay