The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Jinbo Wang; Mingze Wang; Zhanpeng Zhou; Junchi Yan; Weinan E; Lei Wu

arXiv:2502.19002·cs.LG·June 16, 2025

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Weinan E, Lei Wu

PDF

Open Access

TL;DR

This paper identifies a sharpness disparity among transformer blocks during training and introduces a blockwise learning rate strategy that accelerates language model pre-training by nearly two times while reducing memory usage.

Contribution

It uncovers the sharpness disparity in transformer blocks and proposes a blockwise learning rate method that improves training speed and efficiency for large language models.

Findings

01

Achieves nearly 2x speedup in LLM pre-training.

02

Reduces memory usage by 2x with the new method.

03

Demonstrates effectiveness across multiple models and datasets.

Abstract

Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and interactions among these blocks is important. In this paper, we uncover a clear Sharpness Disparity across these blocks, which emerges early in training and intriguingly persists throughout the training process. Motivated by this finding, we propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block's sharpness, accelerating large language model (LLM) pre-training. By integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2 \times$ speedup compared to vanilla AdamW. We demonstrate this acceleration across GPT-2 and LLaMA, with model sizes ranging from 0.12B to 2B and datasets of OpenWebText, MiniPile, and C4. Finally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Domain Adaptation and Few-Shot Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Discriminative Fine-Tuning · Attention Is All You Need · Multi-Head Attention · Adam · Softmax · Dropout · Weight Decay · Cosine Annealing