AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training

Huishuai Zhang; Bohan Wang; Luoxin Chen

arXiv:2505.16363·cs.LG·May 23, 2025

AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training

Huishuai Zhang, Bohan Wang, Luoxin Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

AdamS is a novel optimizer for large language models that uses a new normalization technique based on momentum, offering efficiency, simplicity, and improved performance over AdamW without requiring architectural changes.

Contribution

Introducing AdamS, an optimizer that replaces second-moment estimates with a momentum-based normalization, providing theoretical guarantees and practical benefits for LLM training.

Findings

01

AdamS matches SGD with momentum in efficiency.

02

It outperforms AdamW in LLM pretraining tasks.

03

It is easy to integrate into existing training pipelines.

Abstract

We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed $(L_{0}, L_{1})$ smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pku-huzhang/AdamS
pytorchOfficial

Videos

AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training· underline

Taxonomy

TopicsMachine Learning and Data Classification · Topic Modeling · Artificial Intelligence in Healthcare and Education

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Attention Dropout · Softmax · Weight Decay · Dropout · Linear Layer · Residual Connection