TL;DR
This paper introduces memory-efficient zero-order optimization algorithms, JAGUAR SignSGD and JAGUAR Muon, for fine-tuning large language models, providing theoretical guarantees and competitive empirical results with reduced resource requirements.
Contribution
It presents the first convergence guarantees for SignSGD in stochastic zero-order optimization and introduces novel ZO algorithms leveraging model structure for LLM fine-tuning.
Findings
Achieves comparable or better convergence than first-order methods.
Significantly reduces memory usage during LLM fine-tuning.
Provides theoretical convergence rates for the proposed ZO algorithms.
Abstract
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose , a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose , a novel ZO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsStochastic Gradient Descent · Adam
