(How) Learning Rates Regulate Catastrophic Overtraining

Mark Rofin; Aditya Varre; Nicolas Flammarion

arXiv:2604.13627·cs.LG·April 16, 2026

(How) Learning Rates Regulate Catastrophic Overtraining

Mark Rofin, Aditya Varre, Nicolas Flammarion

PDF

TL;DR

This paper investigates how learning rates influence catastrophic overtraining in large language models during supervised fine-tuning, revealing that learning rate decay sharpens models and worsens forgetting.

Contribution

It uncovers the role of learning rate dynamics in overtraining, linking optimization behavior to model forgetting and providing insights into fine-tuning strategies.

Findings

01

Large learning rates lead to different model convergence behaviors.

02

Learning rate decay increases model sharpness and exacerbates forgetting.

03

Understanding optimization dynamics helps mitigate overtraining in LLMs.

Abstract

Supervised fine-tuning (SFT) is a common first stage of LLM post-training, teaching the model to follow instructions and shaping its behavior as a helpful assistant. At the same time, SFT may harm the fundamental capabilities of an LLM, particularly after long pretraining: a phenomenon known as catastrophic overtraining (Springer et al., 2025). To understand overtraining, we first investigate catastrophic forgetting in finetuning through the lens of implicit regularization of the learning rate. For models trained to the same SFT loss, we identify how the learning rate mediates optimization: finetuning with large and small steps converges to qualitatively different models. Next, we link forgetting to overtraining: learning rate decay increases the sharpness of the pretrained model, which in turn exacerbates catastrophic forgetting during SFT, leading to overtraining. Our findings paint a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.