GradPower: Powering Gradients for Faster Language Model Pre-Training
Jinbo Wang, Mingze Wang, Jiaqi Zhang, Wei Wang, Peng Pei, Xunliang Cai, Weinan E, Lei Wu

TL;DR
GradPower is a simple yet effective gradient transformation technique that accelerates language model pre-training by applying a sign-power transformation, improving convergence across various models and datasets.
Contribution
Introducing GradPower, a lightweight gradient transformation method that enhances optimizer performance with minimal code changes and broad applicability.
Findings
Achieves lower terminal loss across diverse models and datasets.
Most gains observed with modern mixture-of-experts models and warmup-stable-decay schedules.
Seamlessly integrates with optimizers like Adam and Muon, yielding further improvements.
Abstract
We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector , GradPower first applies the elementwise sign-power transformation: for a fixed , and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
