AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs
Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin

TL;DR
AlphaDecay introduces a module-wise adaptive weight decay method for LLMs, guided by spectral analysis, to improve training stability and performance over uniform decay approaches.
Contribution
It proposes a novel adaptive weight decay technique based on spectral properties, enhancing LLM training by balancing module-specific regularization.
Findings
AlphaDecay outperforms uniform decay in perplexity and generalization.
The method is effective across models from 60M to 1B parameters.
Spectral analysis guides optimal decay strength assignment.
Abstract
Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMining Techniques and Economics
