Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence

Hui Yang; Tao Ren; Jinyang Jiang; Wan Tian; Yijie Peng

arXiv:2603.05960·cs.LG·March 11, 2026

Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence

Hui Yang, Tao Ren, Jinyang Jiang, Wan Tian, Yijie Peng

PDF

Open Access

TL;DR

Omni-Masked Gradient Descent (OMGD) is a memory-efficient optimization method for large language models that improves convergence guarantees and demonstrates consistent empirical performance enhancements.

Contribution

The paper introduces OMGD, a novel mask traversal-based optimizer with proven nonconvex convergence guarantees and practical improvements over existing methods.

Findings

01

Achieves improved iteration complexity of a(psilon^{-3}) for stationary points

02

Integrates seamlessly with mainstream optimizers and enhances performance

03

Demonstrates consistent empirical improvements in fine-tuning and pre-training

Abstract

Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard $O (ϵ^{- 4})$ iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of $\tilde{O} (ϵ^{- 3})$ for finding an $ϵ$ -approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Tensor decomposition and applications