Why Gradients Rapidly Increase Near the End of Training
Aaron Defazio

TL;DR
This paper investigates why gradient norms spike near the end of LLM training and identifies an interaction between weight decay, normalization, and learning rate schedules as the cause, proposing a fix that improves training outcomes.
Contribution
The paper reveals the cause of gradient norm spikes in LLM training and introduces a simple correction to improve training stability and reduce loss.
Findings
Gradient norms increase rapidly near training end due to specific interactions.
The proposed correction stabilizes training and lowers loss values.
The fix is simple and effective across different training setups.
Abstract
During long-duration Large Language Model (LLM) training runs the gradient norm increases rapidly near the end of training. In this short note, we show that this increase is due to an unintended interaction between weight decay, normalization layers, and the learning rate schedule. We propose a simple correction that fixes this behavior while also resulting in lower loss values throughout training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Stochastic Gradient Optimization Techniques
