Mpemba Effect in Large-Language Model Training Dynamics: A Minimal Analysis of the Valley-River model

Sibei Liu; Zhijian Hu

arXiv:2507.04206·cs.AI·July 8, 2025

Mpemba Effect in Large-Language Model Training Dynamics: A Minimal Analysis of the Valley-River model

Sibei Liu, Zhijian Hu

PDF

TL;DR

This paper introduces a thermodynamic analogy using the Mpemba effect to explain and optimize learning rate schedules in large language model training, providing a theoretical foundation for plateau-based strategies.

Contribution

It connects training dynamics to the Mpemba effect, deriving conditions for optimal plateau learning rates and offering a minimal analytical model for LR schedule tuning.

Findings

01

The Mpemba effect explains the necessity of warm-up phases in LR schedules.

02

An optimal plateau learning rate, the 'strong Mpemba point,' accelerates convergence.

03

Analytical conditions for the existence of the strong Mpemba point are derived.

Abstract

Learning rate (LR) schedules in large language model (LLM) training often follow empirical templates: warm-up, constant plateau/stable phase, and decay (WSD). However, the mechanistic explanation for this strategy remains underexplored, and the choice of plateau height and decay schedule is largely heuristic. In this paper, we connect training dynamics to a thermodynamic analogy via the Mpemba effect - a phenomenon in which a hotter system cools faster than a colder one when quenched into the same bath. We analyze a class of "valley-river" loss landscapes, where sharp (valley) directions equilibrate quickly, while flatter (river) directions govern global descent. The Mpemba effect provides an explanation for the necessity of the warm-up phase and motivates a high plateau - rather than a low one - for accelerating loss decrease during decay. We show that for certain loss landscapes,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.