Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

Sagnik Mukherjee; Lifan Yuan; Pavan Jayasinha; Dilek Hakkani-T\"ur; Hao Peng

arXiv:2602.07729·cs.LG·February 25, 2026

Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-T\"ur, Hao Peng

PDF

Open Access

TL;DR

This paper shows that stochastic gradient descent (SGD) can match or outperform AdamW in reinforcement learning for large language models, achieving high parameter efficiency and challenging common optimization practices.

Contribution

It demonstrates that SGD, a simpler optimizer, is more effective and memory-efficient than AdamW for RL in LLMs, with significant parameter sparsity in updates.

Findings

01

SGD matches or outperforms AdamW in RL training of LLMs.

02

Full fine-tuning with SGD updates fewer than 0.02% of parameters.

03

RL training benefits less from Adam's adaptive features than supervised learning.

Abstract

Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token prediction stages (e.g., pretraining and supervised fine-tuning), despite fundamental differences between RL and these stages highlighted by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rates in AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam-style per-parameter adaptive learning rates and momentum. Confirming this hypothesis, our experiments demonstrate that the substantially more memory-efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)