Why Gradients Rapidly Increase Near the End of Training

Aaron Defazio

arXiv:2506.02285·cs.LG·June 11, 2025

Why Gradients Rapidly Increase Near the End of Training

Aaron Defazio

PDF

Open Access

TL;DR

This paper investigates why gradient norms spike near the end of LLM training and identifies an interaction between weight decay, normalization, and learning rate schedules as the cause, proposing a fix that improves training outcomes.

Contribution

The paper reveals the cause of gradient norm spikes in LLM training and introduces a simple correction to improve training stability and reduce loss.

Findings

01

Gradient norms increase rapidly near training end due to specific interactions.

02

The proposed correction stabilizes training and lowers loss values.

03

The fix is simple and effective across different training setups.

Abstract

During long-duration Large Language Model (LLM) training runs the gradient norm increases rapidly near the end of training. In this short note, we show that this increase is due to an unintended interaction between weight decay, normalization layers, and the learning rate schedule. We propose a simple correction that fixes this behavior while also resulting in lower loss values throughout training.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Stochastic Gradient Optimization Techniques