Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling
Valery Parfenov, Grigoriy Evseev, Andrey Veprikov, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov

TL;DR
This paper introduces a learnable sampling policy for zero-order optimization in large language model fine-tuning, significantly reducing variance and enabling scalable, memory-efficient training.
Contribution
It proposes a novel policy-driven zero-order framework with theoretical analysis and practical algorithms that improve gradient estimates for large-scale NLP models.
Findings
Enhanced fine-tuning performance on LLM benchmarks
Reduced variance in gradient estimation
Relaxed dependence on parameter dimensionality
Abstract
Fine-tuning large pretrained language models (LLMs) is a cornerstone of modern NLP, yet its growing memory demands (driven by backpropagation and large optimizer States) limit deployment in resource-constrained settings. Zero-order (ZO) methods bypass backpropagation by estimating directional derivatives from forward evaluations, offering substantial memory savings. However, classical ZO estimators suffer from high variance and an adverse dependence on the parameter dimensionality , which has constrained their use to low-dimensional problems. In this work, we propose a policy-driven ZO framework that treats the sampling distribution over perturbation directions as a learnable policy and updates it to reduce the variance of directional estimates. We develop a practical algorithm implementing this idea and provide a theoretical analysis, showing that learned sampling distributions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
