Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes
Kosuke Nishida, Kyosuke Nishida, Kuniko Saito

TL;DR
This paper introduces WeSaR, a reparameterization technique that stabilizes and accelerates large language model training by ensuring uniform parameter norms, effectively mitigating loss spikes during pre-training.
Contribution
We propose WeSaR, a novel weight scaling reparameterization method that stabilizes large language model training by controlling parameter norms, outperforming existing initialization techniques.
Findings
WeSaR stabilizes training across models from 130M to 13B parameters.
It accelerates training convergence compared to baseline methods.
It reduces loss spikes and improves training stability.
Abstract
Loss spikes, a phenomenon in which the loss value diverges suddenly, is a fundamental issue in the pre-training of large language models. This paper supposes that the non-uniformity of the norm of the parameters is one of the causes of loss spikes. Here, in training of neural networks, the scale of the gradients is required to be kept constant throughout the layers to avoid the vanishing and exploding gradients problem. However, to meet these requirements in the Transformer model, the norm of the model parameters must be non-uniform, and thus, parameters whose norm is smaller are more sensitive to the parameter update. To address this issue, we propose a novel technique, weight scaling as reparameterization (WeSaR). WeSaR introduces a gate parameter per parameter matrix and adjusts it to the value satisfying the requirements. Because of the gate parameter, WeSaR sets the norm of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
