Adaptive Preconditioners Trigger Loss Spikes in Adam
Zhiwei Bai, Zhangchen Zhou, Jiajie Zhao, Xiaolong Li, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Yaoyu Zhang, Zhi-Qin John Xu

TL;DR
This paper uncovers that Adam optimizer's adaptive preconditioners can cause loss spikes by entering a regime where they respond sluggishly to gradients, leading to instability and sudden increases in loss during training.
Contribution
The study reveals a novel mechanism where Adam's adaptive preconditioners trigger loss spikes, challenging previous explanations based on loss landscape sharpness.
Findings
Identified a critical regime causing Adam spikes due to slow preconditioner response.
Demonstrated the mechanism across various neural network architectures.
Verified the instability mechanism through extensive experiments.
Abstract
Loss spikes emerge commonly during training across neural networks of varying architectures and scales when using the Adam optimizer. In this work, we investigate the underlying mechanism responsible for Adam spikes. While previous explanations attribute these phenomena to the lower-loss-as-sharper characteristics of the loss landscape, our analysis reveals that Adam's adaptive preconditioners themselves can trigger spikes. Specifically, we identify a critical regime where squared gradients become substantially smaller than the second-order moment estimates, causing the latter to undergo a -exponential decay and to respond sluggishly to current gradient information. This mechanism can push the maximum eigenvalue of the preconditioned Hessian beyond the classical stability threshold for a sustained period, inducing instability. This instability further leads to an…
Peer Reviews
Decision·Submitted to ICLR 2026
This work addresses a significant and longstanding issue in deep learning: the phenomenon of loss spikes during Adam optimization. This problem frequently troubles practitioners but has, until now, been poorly understood. The paper's focus on providing a mechanistic explanation for these instabilities is therefore timely and valuable.
- Lack of Theoretical Novelty and Applicability: The theoretical analysis presented in the paper lacks novelty. The core condition for stability, $\eta \le \frac{2}{\lambda_{\max} (H_t)}$, is a well-known and fundamental result in optimization theory. While the authors attempt to extend this by using a time-varying $\lambda_{\max} (H_t)$, their analysis still relies on the assumption that the Hessian $H_t$ remains constant and positive-definite. This severely limits the applicability of their th
1. The paper presents a clear and convincing mechanistic model explaining how Adam’s adaptive preconditioner leads to spike formation, supported by precise mathematical characterization, direct derivations, and step-by-step proofs. The proposed five-stage framework provides an intuitive yet theoretically grounded view of the phenomenon. 2. Extensive experiments across diverse neural architectures, such as quadratic functions, small MLPs, CNNs, and transformer models, demonstrate the ubiquity an
1. The paper lack experiments comparing Adam with other optimizers or its common variants, such as RMSProp, AdaGrad, or AdamW. Including such baselines would help clarify whether the observed spike dynamics are unique to Adam or shared across adaptive methods. 2. The large-scale experiments rely on synthetic data or controlled training conditions. As a result, the generality of the conclusions for diverse, real-world datasets remains to be convincingly demonstrated. 3. The use of gradient-dire
The paper conducts thorough experiments on an interesting and important topic (loss spikes in Adam). The suggestion that cutting $\beta_2$ can ameliorate loss spikes may be useful for practitioners.
One weakness of the paper is novelty w.r.t prior works, notably https://arxiv.org/abs/2207.14484 (which was cited). That paper previously conducted the stability analysis of Adam that is given here as Proposition 2. Nevertheless, that paper did not emphasize how gradient norm shrinkage leads to sharpening of the preconditioned Hessian, nor did it have Theorem 1, nor did it discuss how some instabilities lead to spikes in the loss whereas others do not.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Advanced Memory and Neural Computing · Stochastic Gradient Optimization Techniques
MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · Adam
