AdaGC: Improving Training Stability for Large Language Model Pretraining
Guoxia Wang, Shuai Li, Congliang Chen, Jinle Zeng, Jiabin Yang, Dianhai Yu, Yanjun Ma, Li Shen

TL;DR
This paper introduces AdaGC, an adaptive gradient clipping method that stabilizes large language model pretraining by effectively mitigating loss spikes caused by various heterogeneous factors, leading to improved training stability and accuracy.
Contribution
AdaGC is an optimizer-agnostic, per-tensor gradient clipping scheme that reduces training instability and improves downstream performance in large-scale language model pretraining.
Findings
AdaGC eliminates training spikes across multiple models.
AdaGC improves downstream accuracy by over 1% in tested models.
AdaGC reduces communication costs in distributed training.
Abstract
Loss spikes remain a persistent obstacle in large-scale language model pretraining. While previous research has attempted to identify the root cause of loss spikes by investigating individual factors, we observe that, in practice, such spikes are typically triggered by the confluence of heterogeneous factors. Empirically, loss spikes may arise from a combination of data outliers, hardware or transient computational faults, numerical precision issues, and hyperparameter settings. Regardless of the underlying cause, these spikes manifest as unstable optimizer updates, as abnormal gradients contaminate both first- and second-moment states. In this paper, we propose a principled gradient-centric remedy: AdaGC, an adaptive per-tensor gradient clipping scheme that mitigates such contamination by bounding gradient norms relative to a tensor-wise exponential moving average of their historical…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper presents strong empirical evidence that the proposed AdaGC method effectively mitigates training loss spikes caused by abnormal gradients. The experimental evaluation is extensive, covering multiple architectures and modalities, including dense models (Llama-2), Mixture-of-Experts models (Mixtral and ERNIE), and a vision-language model (CLIP). Both pretraining and downstream performance are assessed, demonstrating that AdaGC consistently stabilizes training and slightly improves accura
The paper provides solid analysis, but some explanations and comparisons are not entirely clear. The main points are listed here, with more details in the questions section. - **W1:** Section 5.4 **Optimizer Compatibility: Muon and Lion** feels weak in its current form. Since the experiments do not demonstrate that spikes occur, it is unclear how this section supports the main goal of the paper, which is to eliminate loss spikes. - **W2:** The **ablation study about adaptivity and locality**
1. Identifies that “abnormal gradients polluting optimizer states” is the common final path to loss spikes, giving a clean, optimizer-agnostic intervention point. 2. Method simplicity: Per-tensor EMA of gradient norms plus relative clipping; implementation needs ≈4 bytes/tensor and <10 lines of code change, yet completely suppresses spikes on 1.3 B–10 B models. 3. Empirical coverage: Extensive experiments on dense (Llama-2) and MoE (Mixtral, ERNIE) architectures; consistent zero spike scores a
1. The reliance on the exponential moving average (EMA) of per-tensor gradient norms introduces an inherent lag in adapting to sudden gradient spikes. Since the clipping threshold is updated based on historical statistics, AdaGC may fail to respond promptly to abrupt increases in gradient magnitudes. Consequently, outlier gradients could still enter the optimizer state before the EMA sufficiently adjusts, potentially undermining the intended stabilizing effect. 2. While the authors claim that A
* This paper addresses a widespread and costly issue—loss spikes in LLM training—using a simple, optimizer-agnostic remedy. * The proposed method is evaluated on multiple model types and optimizers, showing consistent improvements and detailed ablation analysis.
* The paper treats gradient spikes as a black-box phenomenon, lacking diagnostic analysis (e.g., layer- or token-level gradient behavior) to substantiate the “gradient contamination” hypothesis. * The total token counts (e.g., 36B tokens for LLaMA-2 7B) are small compared to real large-scale pretraining, and many comparisons are limited to early training stages, which raises doubts about whether AdaGC’s stability holds in real-world, trillion-token-scale training. * he reported gains of AdaGC wh
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training · Evolved Sign Momentum · Gradient Clipping · AdamW
