AdaGC: Improving Training Stability for Large Language Model Pretraining

Guoxia Wang; Shuai Li; Congliang Chen; Jinle Zeng; Jiabin Yang; Dianhai Yu; Yanjun Ma; Li Shen

arXiv:2502.11034·cs.LG·February 24, 2026

AdaGC: Improving Training Stability for Large Language Model Pretraining

Guoxia Wang, Shuai Li, Congliang Chen, Jinle Zeng, Jiabin Yang, Dianhai Yu, Yanjun Ma, Li Shen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces AdaGC, an adaptive gradient clipping method that stabilizes large language model pretraining by effectively mitigating loss spikes caused by various heterogeneous factors, leading to improved training stability and accuracy.

Contribution

AdaGC is an optimizer-agnostic, per-tensor gradient clipping scheme that reduces training instability and improves downstream performance in large-scale language model pretraining.

Findings

01

AdaGC eliminates training spikes across multiple models.

02

AdaGC improves downstream accuracy by over 1% in tested models.

03

AdaGC reduces communication costs in distributed training.

Abstract

Loss spikes remain a persistent obstacle in large-scale language model pretraining. While previous research has attempted to identify the root cause of loss spikes by investigating individual factors, we observe that, in practice, such spikes are typically triggered by the confluence of heterogeneous factors. Empirically, loss spikes may arise from a combination of data outliers, hardware or transient computational faults, numerical precision issues, and hyperparameter settings. Regardless of the underlying cause, these spikes manifest as unstable optimizer updates, as abnormal gradients contaminate both first- and second-moment states. In this paper, we propose a principled gradient-centric remedy: AdaGC, an adaptive per-tensor gradient clipping scheme that mitigates such contamination by bounding gradient norms relative to a tensor-wise exponential moving average of their historical…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The paper presents strong empirical evidence that the proposed AdaGC method effectively mitigates training loss spikes caused by abnormal gradients. The experimental evaluation is extensive, covering multiple architectures and modalities, including dense models (Llama-2), Mixture-of-Experts models (Mixtral and ERNIE), and a vision-language model (CLIP). Both pretraining and downstream performance are assessed, demonstrating that AdaGC consistently stabilizes training and slightly improves accura

Weaknesses

The paper provides solid analysis, but some explanations and comparisons are not entirely clear. The main points are listed here, with more details in the questions section. - **W1:** Section 5.4 **Optimizer Compatibility: Muon and Lion** feels weak in its current form. Since the experiments do not demonstrate that spikes occur, it is unclear how this section supports the main goal of the paper, which is to eliminate loss spikes. - **W2:** The **ablation study about adaptivity and locality**

Reviewer 02Rating 6Confidence 4

Strengths

1. Identifies that “abnormal gradients polluting optimizer states” is the common final path to loss spikes, giving a clean, optimizer-agnostic intervention point. 2. Method simplicity: Per-tensor EMA of gradient norms plus relative clipping; implementation needs ≈4 bytes/tensor and <10 lines of code change, yet completely suppresses spikes on 1.3 B–10 B models. 3. Empirical coverage: Extensive experiments on dense (Llama-2) and MoE (Mixtral, ERNIE) architectures; consistent zero spike scores a

Weaknesses

1. The reliance on the exponential moving average (EMA) of per-tensor gradient norms introduces an inherent lag in adapting to sudden gradient spikes. Since the clipping threshold is updated based on historical statistics, AdaGC may fail to respond promptly to abrupt increases in gradient magnitudes. Consequently, outlier gradients could still enter the optimizer state before the EMA sufficiently adjusts, potentially undermining the intended stabilizing effect. 2. While the authors claim that A

Reviewer 03Rating 4Confidence 3

Strengths

* This paper addresses a widespread and costly issue—loss spikes in LLM training—using a simple, optimizer-agnostic remedy. * The proposed method is evaluated on multiple model types and optimizers, showing consistent improvements and detailed ablation analysis.

Weaknesses

* The paper treats gradient spikes as a black-box phenomenon, lacking diagnostic analysis (e.g., layer- or token-level gradient behavior) to substantiate the “gradient contamination” hypothesis. * The total token counts (e.g., 36B tokens for LLaMA-2 7B) are small compared to real large-scale pretraining, and many comparisons are limited to early training stages, which raises doubts about whether AdaGC’s stability holds in real-world, trillion-token-scale training. * he reported gains of AdaGC wh

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training · Evolved Sign Momentum · Gradient Clipping · AdamW