Methods of improving LLM training stability
Oleg Rybakov, Mike Chrzanowski, Peter Dykas, Jinze Xue, Ben Lanir

TL;DR
This paper investigates training instabilities in large language models, focusing on the growth of linear layer outputs, and proposes normalization techniques to improve training stability and perplexity.
Contribution
It extends previous work by analyzing all linear layer outputs in Transformers and introduces normalization methods that enable higher learning rates and better model performance.
Findings
Applying layer normalization to additional layers increases stable learning rates.
Layer normalization after QK layers allows 1.5x higher learning rates without divergence.
Proposed methods significantly improve perplexity over baseline models.
Abstract
Training stability of large language models(LLMs) is an important research topic. Reproducing training instabilities can be costly, so we use a small language model with 830M parameters and experiment with higher learning rates to force models to diverge. One of the sources of training instability is the growth of logits in attention layers. We extend the focus of the previous work and look not only at the magnitude of the logits but at all outputs of linear layers in the Transformer block. We observe that with a high learning rate the L2 norm of all linear layer outputs can grow with each training step and the model diverges. Specifically we observe that QKV, Proj and FC2 layers have the largest growth of the output magnitude. This prompts us to explore several options: 1) apply layer normalization not only after QK layers but also after Proj and FC2 layers too; 2) apply layer…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1) The finding is interesting that multiple layers are responsible for training instability because of the output growth. 2) They highlighted very well different training stability methods proposed in the literatures.
1) Quality of the figures are low, the authors should use vector images. 2) Results section needs more evidence of different experiments they did rather than just two table they showed. The authors could add the training curves of how different methods are helping to improve stability. 3) They study the instability with higher learning rate and make the model training stable for the higher learning rate. There is no evidence if the model is giving training convergence faster at different scales.
This paper proposes new methods of applying layer normalization and softmax capping across various layers in Transformer blocks to enhance training stability and improve perplexity in large language models.
1. The paper’s format does not adhere to the ICLR25 standard. 2. The paper addresses the training stability of large language models (LLMs), stating that larger models tend to have decreased stability. However, the experimental results do not provide evidence that the method can be scaled to models with 10 billion or 100 billion parameters. 3. The experiments and results lack any comparison of training losses. Instead, the author only provides a table of divergence/convergence rates, which is in
-The paper does a good job of comparing and contrasting a range of methods to mitigate model divergence when training. -The paper reports which methods worked, as well as which did not offer any benefit. Both of these results are incredibly valuable to analyzing model divergence as they offer further intuition for why certain methods do better.
-The paper should report results on a range of small model sizes. At the moment, there are only results for a 830M parameter model, and it is not obvious how these findings would scale down. -It would be helpful to provide further motivation and examples for why training instability might be a serious issue that requires mitigation -Many of the Figures need labels on the axes -Figure 2: It would be good to have more examples of learning rate to show exactly how the model stops converging
* Improvements seem significant in terms of perplexity * Good number of comparisons
* Given the availability of H100 GPUs to the authors investigating the properties at fp8 would have been interesting too, as with lower precision comes greater instability. This would significantly raise the relevance of the publication. * The datasets were not disclosed making reproducibility virtually impossible. While this is sadly a common practice for a lot of LLM paper, given the small parameter count of the model I feel reproducing the experiments would be more realistic, so I do not see
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Processing Techniques · Industrial Engineering and Technologies · Engineering Diagnostics and Reliability
MethodsAttention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Multi-Head Attention · Adam · Dropout · Byte Pair Encoding · Absolute Position Encodings · Label Smoothing
