AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training
Fu-Ming Guo, Yingfang Fan

TL;DR
AdamHD introduces a decoupled Huber regularizer for optimizer updates in language model pre-training, improving convergence speed, sparsity, and robustness over traditional weight decay methods like AdamW.
Contribution
It proposes AdamHuberDecay, a novel optimizer that replaces the $ ext{L}_2$ penalty with a smooth Huber regularizer, enhancing training efficiency and model sparsity.
Findings
Converges 10-15% faster in wall-clock time.
Reduces validation perplexity by up to 4 points.
Achieves 2.5-4.7% performance improvements on downstream tasks.
Abstract
Adaptive optimizers with decoupled weight decay, such as AdamW, are the de facto standard for pre-training large transformer-based generative models. Yet the quadratic nature of the penalty embedded in weight decay drives all parameters toward the origin at the same rate, making the update vulnerable to rare but extreme gradient directions and often over-penalizing well-conditioned coordinates. We propose AdamHuberDecay, a drop-in replacement for AdamW that substitutes the penalty with a decoupled smooth Huber regularizer. The resulting update decays parameters quadratically while their magnitude remains below a threshold , and linearly (-like) once they exceed , yielding (i) bounded regularization gradients, (ii) invariance to per-coordinate second-moment rescaling, and (iii) stronger sparsity pressure on overgrown weights. We derive the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning in Materials Science · Topic Modeling
