GradientStabilizer:Fix the Norm, Not the Gradient
Tianjin Huang, Zhangyang Wang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Jiaxing Shang, Tianlong Chen, Ke Li, Lu Liu, Qingsong Wen, Shiwei Liu

TL;DR
GradientStabilizer is a novel gradient transformation technique that stabilizes training by bounding gradient norms, improving robustness and reducing divergence across various deep learning tasks without the need for threshold tuning.
Contribution
It introduces a statistically stabilized gradient magnitude estimate that preserves direction while bounding size, enhancing training stability over traditional clipping methods.
Findings
Consistently improves training stability across multiple tasks.
Widens stable learning-rate regions.
Reduces divergence and sensitivity to weight decay.
Abstract
Training instability in modern deep learning systems is frequently triggered by rare but extreme gradient-norm spikes, which can induce oversized parameter updates, corrupt optimizer state, and lead to slow recovery or divergence. Widely used safeguards such as gradient clipping mitigate these failures but require threshold tuning and indiscriminately truncate large updates. We propose GradientStabilizer, a lightweight, drop-in gradient transform that preserves the instantaneous gradient direction while replacing the update magnitude with a statistically stabilized estimate derived from running gradient-norm statistics. We prove that the resulting stabilized magnitude is uniformly bounded on spike steps, independent of the spike size, and show how this boundedness controls optimizer state evolution in adaptive methods. Across LLM pre-training (FP16), quantization-aware pre-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Materials and Mechanics
MethodsAdam · Gradient Normalization
