Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence
Hui Yang, Tao Ren, Jinyang Jiang, Wan Tian, Yijie Peng

TL;DR
Omni-Masked Gradient Descent (OMGD) is a memory-efficient optimization method for large language models that improves convergence guarantees and demonstrates consistent empirical performance enhancements.
Contribution
The paper introduces OMGD, a novel mask traversal-based optimizer with proven nonconvex convergence guarantees and practical improvements over existing methods.
Findings
Achieves improved iteration complexity of a(psilon^{-3}) for stationary points
Integrates seamlessly with mainstream optimizers and enhances performance
Demonstrates consistent empirical improvements in fine-tuning and pre-training
Abstract
Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of for finding an -approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Tensor decomposition and applications
