Initialization of Large Language Models via Reparameterization to   Mitigate Loss Spikes

Kosuke Nishida; Kyosuke Nishida; Kuniko Saito

arXiv:2410.05052·cs.CL·October 8, 2024

Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes

Kosuke Nishida, Kyosuke Nishida, Kuniko Saito

PDF

Open Access

TL;DR

This paper introduces WeSaR, a reparameterization technique that stabilizes and accelerates large language model training by ensuring uniform parameter norms, effectively mitigating loss spikes during pre-training.

Contribution

We propose WeSaR, a novel weight scaling reparameterization method that stabilizes large language model training by controlling parameter norms, outperforming existing initialization techniques.

Findings

01

WeSaR stabilizes training across models from 130M to 13B parameters.

02

It accelerates training convergence compared to baseline methods.

03

It reduces loss spikes and improves training stability.

Abstract

Loss spikes, a phenomenon in which the loss value diverges suddenly, is a fundamental issue in the pre-training of large language models. This paper supposes that the non-uniformity of the norm of the parameters is one of the causes of loss spikes. Here, in training of neural networks, the scale of the gradients is required to be kept constant throughout the layers to avoid the vanishing and exploding gradients problem. However, to meet these requirements in the Transformer model, the norm of the model parameters must be non-uniform, and thus, parameters whose norm is smaller are more sensitive to the parameter update. To address this issue, we propose a novel technique, weight scaling as reparameterization (WeSaR). WeSaR introduces a gate parameter per parameter matrix and adjusts it to the value satisfying the requirements. Because of the gate parameter, WeSaR sets the norm of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings