Stable Language Model Pre-training by Reducing Embedding Variability

Woojin Chung; Jiwoo Hong; Na Min An; James Thorne; Se-Young Yun

arXiv:2409.07787·cs.CL·September 13, 2024

Stable Language Model Pre-training by Reducing Embedding Variability

Woojin Chung, Jiwoo Hong, Na Min An, James Thorne, Se-Young Yun

PDF

Open Access

TL;DR

This paper introduces Token Embedding Variability as an efficient proxy for pre-training stability and proposes Multi-head Low-Rank Attention to mitigate gradient explosion, leading to more stable and better-performing language models.

Contribution

It presents a novel proxy for stability assessment and a new architecture to improve pre-training stability in language models.

Findings

01

MLRA reduces embedding variance and gradient explosion.

02

Increased stability and lower perplexity in GPT-2 models.

03

Effective in deeper language models.

Abstract

Stable pre-training is essential for achieving better-performing language models. However, tracking pre-training stability by calculating gradient variance at every step is impractical due to the significant computational costs. We explore Token Embedding Variability (TEV) as a simple and efficient proxy for assessing pre-training stability in language models with pre-layer normalization, given that shallower layers are more prone to gradient explosion (section 2.2). Moreover, we propose Multi-head Low-Rank Attention (MLRA) as an architecture to alleviate such instability by limiting the exponential growth of output embedding variance, thereby preventing the gradient explosion (section 3.2). Empirical results on GPT-2 with MLRA demonstrate increased stability and lower perplexity, particularly in deeper models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Multi-Head Attention · Cosine Annealing · Byte Pair Encoding · Softmax · Dropout · Layer Normalization · Attention Is All You Need · Adam