SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization

Wooin Lee; Hyun-Tae Kim

arXiv:2604.07663·cs.LG·April 17, 2026

SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization

Wooin Lee, Hyun-Tae Kim

PDF

TL;DR

SAGE is a novel optimizer that reduces memory usage in large language model training by combining a Lion-style update with a memory-efficient adaptive scale, improving convergence and perplexity.

Contribution

SAGE introduces a sign-adaptive, memory-efficient optimizer that effectively handles high-variance gradients, outperforming existing methods in LLM training.

Findings

01

SAGE achieves state-of-the-art perplexity on Llama models up to 1.3B parameters.

02

SAGE significantly reduces optimizer state memory compared to AdamW.

03

SAGE demonstrates better convergence stability than existing light-state optimizers.

Abstract

The AdamW optimizer, while standard for LLM pretraining, is a critical memory bottleneck, consuming optimizer states equivalent to twice the model's size. Although light-state optimizers like SinkGD attempt to address this issue, we identify the embedding layer dilemma: these methods fail to handle the sparse, high-variance gradients inherent to embeddings, forcing a hybrid design that reverts to AdamW and partially negates the memory gains. We propose SAGE (Sign Adaptive GradiEnt), a novel optimizer that resolves this dilemma by replacing AdamW in this hybrid structure. SAGE combines a Lion-style update direction with a new, memory-efficient $O (d)$ adaptive scale. This scale acts as a "safe damper," provably bounded by 1.0, which tames high-variance dimensions more effectively than existing methods. This superior stability allows SAGE to achieve better convergence. On Llama models up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.