Spike No More: Stabilizing the Pre-training of Large Language Models

Sho Takase; Shun Kiyono; Sosuke Kobayashi; Jun Suzuki

arXiv:2312.16903·cs.CL·July 28, 2025·2 cites

Spike No More: Stabilizing the Pre-training of Large Language Models

Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the causes of loss spikes during large language model pre-training and proposes conditions to stabilize training by controlling gradient norms, validated through theoretical analysis and experiments.

Contribution

It identifies key factors—small sub-layers and large shortcuts—that prevent loss spikes, offering practical guidelines for stable pre-training.

Findings

01

Stabilizing requires small sub-layer spectral norms.

02

Large shortcut connections help prevent loss spikes.

03

Methods satisfying these conditions effectively stabilize training.

Abstract

Loss spikes often occur during pre-training of large language models. The spikes degrade the performance of large language models and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. Based on the assumption that the loss spike is caused by the sudden growth of the gradient norm, we explore factors to keep the gradient norm small through an analysis of the spectral norms of the Jacobian matrices for the sub-layers. Our findings suggest that stabilizing the pre-training process requires two conditions: small sub-layers and large shortcut. We conduct various experiments to empirically verify our theoretical analyses. Experimental results demonstrate that methods satisfying the conditions effectively prevent loss spikes during pre-training.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

* The findings (i) and (ii) from the analysis are well presented, although they have been previously utilized in past studies. * This work examines various learning hyperparameters. * This work also presents the results for the 13B model in Table 6 of the Appendix. * The paper is well-written and easy to understand.

Weaknesses

* Although the theoretical analysis is intriguing, I question the practical value of this work, as most practices described in Section 4 are already in use. Utilizing small values for initialization to ensure stable training is well-known, and both Scaled Embed and Embed LM have been introduced in prior literature. If this work could offer a novel, advanced method for embedding normalization, it might receive more interest from the community. * The activation function F was assumed to be either

Reviewer 02Rating 3Confidence 3

Strengths

The paper conducts mathematical analysis to demonstrate the requisite terms they later leverage. Paper is clear and provides actionable results.

Weaknesses

The authors only tested on smaller models, it is well established that most instability problems happen with larger models (>100B parameters). It would be beneficial to evaluate the loss curves on larger models or more diverse datasets Although the focus of this paper was to stabilize training, they underperform on loss-curves compared to vanilla approaches to disprove Le Scao et al's findings. This is hypothesized to be related to learning-rates - which is demonstrated by looking at a absolut

Reviewer 03Rating 5Confidence 4

Strengths

The paper effectively addresses the critical issue of loss spikes during training, providing a detailed analysis of their relationship with gradient norms and embedding means. It evaluates existing approaches like "Scaled Embed" and "Embed LN," discussing their effectiveness in mitigating spikes. Additionally, the paper offers valuable insights into the impact of learning rate adjustments on model stability and compares the behavior of spikes in large language models (LLMs) with smaller models,

Weaknesses

The paper suffers from an unclear relationship between spikes and poor performance, with insufficient explanations of key terms and assumptions. The evaluation section is not well-explained, and there are inconsistencies in terminology. Additionally, it lacks necessary plots and data to support its assumptions, and some figures are difficult to interpret due to overlapping lines. Reproducibility is a concern as common public architectures are not used, and some discussions are considered irrelev

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsFocus