Methods of improving LLM training stability

Oleg Rybakov; Mike Chrzanowski; Peter Dykas; Jinze Xue; Ben Lanir

arXiv:2410.16682·cs.CL·October 23, 2024

Methods of improving LLM training stability

Oleg Rybakov, Mike Chrzanowski, Peter Dykas, Jinze Xue, Ben Lanir

PDF

Open Access 4 Reviews

TL;DR

This paper investigates training instabilities in large language models, focusing on the growth of linear layer outputs, and proposes normalization techniques to improve training stability and perplexity.

Contribution

It extends previous work by analyzing all linear layer outputs in Transformers and introduces normalization methods that enable higher learning rates and better model performance.

Findings

01

Applying layer normalization to additional layers increases stable learning rates.

02

Layer normalization after QK layers allows 1.5x higher learning rates without divergence.

03

Proposed methods significantly improve perplexity over baseline models.

Abstract

Training stability of large language models(LLMs) is an important research topic. Reproducing training instabilities can be costly, so we use a small language model with 830M parameters and experiment with higher learning rates to force models to diverge. One of the sources of training instability is the growth of logits in attention layers. We extend the focus of the previous work and look not only at the magnitude of the logits but at all outputs of linear layers in the Transformer block. We observe that with a high learning rate the L2 norm of all linear layer outputs can grow with each training step and the model diverges. Specifically we observe that QKV, Proj and FC2 layers have the largest growth of the output magnitude. This prompts us to explore several options: 1) apply layer normalization not only after QK layers but also after Proj and FC2 layers too; 2) apply layer…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 3

Strengths

1) The finding is interesting that multiple layers are responsible for training instability because of the output growth. 2) They highlighted very well different training stability methods proposed in the literatures.

Weaknesses

1) Quality of the figures are low, the authors should use vector images. 2) Results section needs more evidence of different experiments they did rather than just two table they showed. The authors could add the training curves of how different methods are helping to improve stability. 3) They study the instability with higher learning rate and make the model training stable for the higher learning rate. There is no evidence if the model is giving training convergence faster at different scales.

Reviewer 02Rating 3Confidence 4

Strengths

This paper proposes new methods of applying layer normalization and softmax capping across various layers in Transformer blocks to enhance training stability and improve perplexity in large language models.

Weaknesses

1. The paper’s format does not adhere to the ICLR25 standard. 2. The paper addresses the training stability of large language models (LLMs), stating that larger models tend to have decreased stability. However, the experimental results do not provide evidence that the method can be scaled to models with 10 billion or 100 billion parameters. 3. The experiments and results lack any comparison of training losses. Instead, the author only provides a table of divergence/convergence rates, which is in

Reviewer 03Rating 5Confidence 3

Strengths

-The paper does a good job of comparing and contrasting a range of methods to mitigate model divergence when training. -The paper reports which methods worked, as well as which did not offer any benefit. Both of these results are incredibly valuable to analyzing model divergence as they offer further intuition for why certain methods do better.

Weaknesses

-The paper should report results on a range of small model sizes. At the moment, there are only results for a 830M parameter model, and it is not obvious how these findings would scale down. -It would be helpful to provide further motivation and examples for why training instability might be a serious issue that requires mitigation -Many of the Figures need labels on the axes -Figure 2: It would be good to have more examples of learning rate to show exactly how the model stops converging

Reviewer 04Rating 3Confidence 3

Strengths

* Improvements seem significant in terms of perplexity * Good number of comparisons

Weaknesses

* Given the availability of H100 GPUs to the authors investigating the properties at fp8 would have been interesting too, as with lower precision comes greater instability. This would significantly raise the relevance of the publication. * The datasets were not disclosed making reproducibility virtually impossible. While this is sadly a common practice for a lot of LLM paper, given the small parameter count of the model I feel reproducing the experiments would be more realistic, so I do not see

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Processing Techniques · Industrial Engineering and Technologies · Engineering Diagnostics and Reliability

MethodsAttention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Multi-Head Attention · Adam · Dropout · Byte Pair Encoding · Absolute Position Encodings · Label Smoothing