Grams: Gradient Descent with Adaptive Momentum Scaling
Yang Cao, Xiaoyu Li, Zhao Song

TL;DR
Grams is a new optimization algorithm for deep learning that separates gradient direction from momentum-based magnitude scaling, leading to faster convergence and better generalization than existing methods.
Contribution
It introduces a novel optimizer that decouples update direction from magnitude scaling, with theoretical guarantees and superior empirical performance.
Findings
Faster convergence than Adam and Lion.
Better generalization in training large language models.
Theoretical proof of global convergence.
Abstract
We introduce radient Descent with daptive omentum caling (), a novel optimization algorithm that decouples the direction and magnitude of parameter updates in deep learning. Unlike traditional optimizers that directly integrate momentum into updates, Grams separates the update direction, derived from current gradients, from momentum, which is used solely for adaptive magnitude scaling. This approach enables Grams to achieve improved loss descent compared to state-of-the-art cautious and momentum-based optimizers. We theoretically demonstrate that Grams descents faster than other state-of-the-art optimizers and establish a global convergence guarantee for Grams. We also validate its effectiveness through extensive empirical evaluations. The results demonstrate Grams' superior performance, including faster convergence and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSlime Mold and Myxomycetes Research
MethodsAdam · Evolved Sign Momentum
