On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie

TL;DR
This paper demonstrates that random masking of parameter updates in adaptive optimizers can improve training of large language models by inducing beneficial regularization, with the proposed Magma method further enhancing performance.
Contribution
It introduces a novel masked update technique that outperforms existing optimizers and presents Magma, a simple method that improves LLM training efficiency and effectiveness.
Findings
Masked RMSProp outperforms state-of-the-art optimizers.
Magma reduces perplexity by over 19% on 1B models.
Masked updates induce curvature-dependent regularization.
Abstract
Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Data Classification · Topic Modeling
