TL;DR
This paper proposes a layerwise learning rate scheme for LLMs based on heavy-tailed spectral analysis, leading to faster training and better generalization with minimal tuning.
Contribution
It introduces Heavy-Tail Guided Layerwise Learning Rates (LLR), a novel adaptive method for assigning layer-specific learning rates based on spectral properties.
Findings
Achieves up to 1.5x training speedup.
Improves zero-shot accuracy from 47.09% to 49.02%.
Transfers nearly optimal learning rates from baseline.
Abstract
Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate their training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes balanced training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
