Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

Egor Petrov; Grigoriy Evseev; Aleksey Antonov; Andrey Veprikov; Nikolay Bushkov; Stanislav Moiseev; Aleksandr Beznosikov

arXiv:2506.04430·cs.LG·October 17, 2025

Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov

PDF

1 Repo

TL;DR

This paper introduces memory-efficient zero-order optimization algorithms, JAGUAR SignSGD and JAGUAR Muon, for fine-tuning large language models, providing theoretical guarantees and competitive empirical results with reduced resource requirements.

Contribution

It presents the first convergence guarantees for SignSGD in stochastic zero-order optimization and introduces novel ZO algorithms leveraging model structure for LLM fine-tuning.

Findings

01

Achieves comparable or better convergence than first-order methods.

02

Significantly reduces memory usage during LLM fine-tuning.

03

Provides theoretical convergence rates for the proposed ZO algorithms.

Abstract

Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $JAGUAR SignSGD$ , a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $O (1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $JAGUAR Muon$ , a novel ZO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

brain-mmo-lab/zo_llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsStochastic Gradient Descent · Adam