Stable Language Model Pre-training by Reducing Embedding Variability
Woojin Chung, Jiwoo Hong, Na Min An, James Thorne, Se-Young Yun

TL;DR
This paper introduces Token Embedding Variability as an efficient proxy for pre-training stability and proposes Multi-head Low-Rank Attention to mitigate gradient explosion, leading to more stable and better-performing language models.
Contribution
It presents a novel proxy for stability assessment and a new architecture to improve pre-training stability in language models.
Findings
MLRA reduces embedding variance and gradient explosion.
Increased stability and lower perplexity in GPT-2 models.
Effective in deeper language models.
Abstract
Stable pre-training is essential for achieving better-performing language models. However, tracking pre-training stability by calculating gradient variance at every step is impractical due to the significant computational costs. We explore Token Embedding Variability (TEV) as a simple and efficient proxy for assessing pre-training stability in language models with pre-layer normalization, given that shallower layers are more prone to gradient explosion (section 2.2). Moreover, we propose Multi-head Low-Rank Attention (MLRA) as an architecture to alleviate such instability by limiting the exponential growth of output embedding variance, thereby preventing the gradient explosion (section 3.2). Empirical results on GPT-2 with MLRA demonstrate increased stability and lower perplexity, particularly in deeper models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Multi-Head Attention · Cosine Annealing · Byte Pair Encoding · Softmax · Dropout · Layer Normalization · Attention Is All You Need · Adam
