SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization
Wooin Lee, Hyun-Tae Kim

TL;DR
SAGE is a novel optimizer that reduces memory usage in large language model training by combining a Lion-style update with a memory-efficient adaptive scale, improving convergence and perplexity.
Contribution
SAGE introduces a sign-adaptive, memory-efficient optimizer that effectively handles high-variance gradients, outperforming existing methods in LLM training.
Findings
SAGE achieves state-of-the-art perplexity on Llama models up to 1.3B parameters.
SAGE significantly reduces optimizer state memory compared to AdamW.
SAGE demonstrates better convergence stability than existing light-state optimizers.
Abstract
The AdamW optimizer, while standard for LLM pretraining, is a critical memory bottleneck, consuming optimizer states equivalent to twice the model's size. Although light-state optimizers like SinkGD attempt to address this issue, we identify the embedding layer dilemma: these methods fail to handle the sparse, high-variance gradients inherent to embeddings, forcing a hybrid design that reverts to AdamW and partially negates the memory gains. We propose SAGE (Sign Adaptive GradiEnt), a novel optimizer that resolves this dilemma by replacing AdamW in this hybrid structure. SAGE combines a Lion-style update direction with a new, memory-efficient adaptive scale. This scale acts as a "safe damper," provably bounded by 1.0, which tames high-variance dimensions more effectively than existing methods. This superior stability allows SAGE to achieve better convergence. On Llama models up…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
