One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

Di He; Songjun Tu; Keyu Wang; Lu Yin; Shiwei Liu

arXiv:2605.22297·cs.LG·May 22, 2026

One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

Di He, Songjun Tu, Keyu Wang, Lu Yin, Shiwei Liu

PDF

1 Repo

TL;DR

This paper proposes a layerwise learning rate scheme for LLMs based on heavy-tailed spectral analysis, leading to faster training and better generalization with minimal tuning.

Contribution

It introduces Heavy-Tail Guided Layerwise Learning Rates (LLR), a novel adaptive method for assigning layer-specific learning rates based on spectral properties.

Findings

01

Achieves up to 1.5x training speedup.

02

Improves zero-shot accuracy from 47.09% to 49.02%.

03

Transfers nearly optimal learning rates from baseline.

Abstract

Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate their training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes balanced training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hed-ucas/Layer-wise-Learning-Rate
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.