AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training

Fu-Ming Guo; Yingfang Fan

arXiv:2511.14721·cs.LG·November 19, 2025

AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training

Fu-Ming Guo, Yingfang Fan

PDF

Open Access

TL;DR

AdamHD introduces a decoupled Huber regularizer for optimizer updates in language model pre-training, improving convergence speed, sparsity, and robustness over traditional weight decay methods like AdamW.

Contribution

It proposes AdamHuberDecay, a novel optimizer that replaces the $ ext{L}_2$ penalty with a smooth Huber regularizer, enhancing training efficiency and model sparsity.

Findings

01

Converges 10-15% faster in wall-clock time.

02

Reduces validation perplexity by up to 4 points.

03

Achieves 2.5-4.7% performance improvements on downstream tasks.

Abstract

Adaptive optimizers with decoupled weight decay, such as AdamW, are the de facto standard for pre-training large transformer-based generative models. Yet the quadratic nature of the $ℓ_{2}$ penalty embedded in weight decay drives all parameters toward the origin at the same rate, making the update vulnerable to rare but extreme gradient directions and often over-penalizing well-conditioned coordinates. We propose AdamHuberDecay, a drop-in replacement for AdamW that substitutes the $ℓ_{2}$ penalty with a decoupled smooth Huber regularizer. The resulting update decays parameters quadratically while their magnitude remains below a threshold $δ$ , and linearly ( $ℓ_{1}$ -like) once they exceed $δ$ , yielding (i) bounded regularization gradients, (ii) invariance to per-coordinate second-moment rescaling, and (iii) stronger sparsity pressure on overgrown weights. We derive the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning in Materials Science · Topic Modeling