TL;DR
VAMO is a novel variance-reduced zeroth-order optimizer that achieves faster convergence and lower memory usage for large-scale nonconvex optimization, outperforming traditional first- and zeroth-order methods.
Contribution
We introduce VAMO, a hybrid variance-reduced zeroth-order optimizer with a dimension-agnostic convergence rate, improving efficiency and memory footprint over existing methods.
Findings
VAMO outperforms existing FO and ZO methods in neural network training.
VAMO achieves a convergence rate of O(1/T + 1/b), surpassing traditional ZO and SGD rates.
VAMO requires less dynamic memory, making it suitable for edge deployment.
Abstract
Optimizing large-scale nonconvex problems, common in deep learning, demands balancing rapid convergence with computational efficiency. First-order (FO) optimizers, which serve as today's baselines, provide fast convergence and good generalization but often incur high computation and memory costs due to the large size of modern models. Conversely, zeroth-order (ZO) algorithms reduce this burden using estimated gradients, yet their slow convergence in high-dimensional settings limits practicality. We introduce VAMO (VAriance-reduced Mixed-gradient Optimizer), a stochastic variance-reduced method that extends mini-batch SGD with full-batch ZO gradients under an SVRG-style framework. VAMO's hybrid design utilizes a two-point ZO estimator to achieve a dimension-agnostic convergence rate of , where is the number of iterations and is the batch-size, surpassing…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. I like the discussion of memory: the paper argues VAMO’s snapshot uses forward-only ZO passes, so peak dynamic memory resembles ZO-SGD rather than FO-SVRG, and optimizer state is lighter than Adam/Adagrad. Included tables and a clear decomposition (weights / states / dynamics) are helpful. 2. The main bound removes the typical d dependence of ZO methods and improves over FO-SGD’s O(1/\sqrt{T}) in theory.
1.ZO methods inherently trade off performance, memory, and wall-clock. The paper treats ZO snapshots as “cheap,” but a full-batch snapshot still requires multiple forward passes per direction over the entire dataset (or many mini-batches). In practice, if memory is the bottleneck then ZO can help. However, runtime can balloon unless the number of directions q is tiny—yet shrinking q raises estimator variance and hurts convergence. Therefore, without an accounting of function evaluations (FEs) an
1. VAMO achieves a fast, linear convergence rate of $\mathcal{O}(1/T)$, which is an improvement over the $\mathcal{O}(1/\sqrt{T})$ rate of standard SGD. 2. A key advantage is that its convergence rate is independent of the model's parameter dimension $d$. This allows it to overcome the "curse of dimensionality" that makes purely Zeroth-Order (ZO) methods impractical for large models. 3. This paper provides a strong theoretical guarantee. VAMO's gradient estimator is designed to be unbiased.
1. The experimental results demonstrate that VAMO achieves better performance than SGD, but it fails to match the convergence speed and training efficiency of the Adam optimizer during large-scale fine-tuning. 2. The theoretical analysis shows that VAMO's convergence rate, while faster than SGD's, includes an additional error term of $\mathcal{O}(1/b)$ that is not present in the purely First-Order FO-SVRG algorithm. 3. The paper introduces a multi-point variant to minimize the additional error t
Novel Hybrid Estimator: The proposed gradient estimator (Eq. 6) is novel. The insight to use a ZO-based correction term α(ˆ∇fIk (ˆx) − ˆ∇f(ˆx)), which has zero expectation, is technically sound. This ensures the full estimator vsk is an unbiased estimator of the true current gradient ∇f(xsk), which is an elegant property for the analysis. 2. Dimension-Independent ZO-Hybrid Rate: The theoretical analysis successfully breaks the dimension-dependency curse of pure ZO methods. Achieving a rate of O
Despite its theoretical novelty, the paper’s core claims about its practical advantages are based on a series of critical, and in some cases contradictory, flaws in the analysis of its computational and memory costs. 1. Fundamentally Misleading Convergence Claims: The paper repeatedly claims an O(1/T ) rate, equating it with FO-SVRG (e.g., Abstract: "significantly improving over SGD’s... rate"; Conclusion: "achieving convergence performance similar to FO-SVRG"). This is a misrepresentation. The
- Combining FO and ZO in SVRG is an interesting idea and the FO order component improves the performance of the algorithm compared with other ZO methods. - The presentation of the paper is easy to follow with detailed comparison with prior works. - Proofs seem all good.
I have several concerns below. - I'm not sure I understand the memory analysis in Section 4.3, B.1 and Table 3. - For VAMO, why is the memory for the optimizer states only $|x|$? The algorithm needs to store $\hat{\nabla}(x_{\text{cpt}})$ but also $x_{\text{cpt}}$ to compute the estimate $\hat{\nabla}(x_{\text{cpt}}, B)$ (see also line 3-4 in Alg. 1). I don't understand how to reduce the memory to only $|x|$? If my understanding is true, I don't see an improvement in the memory for the propos
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
