Unified Theory of Adaptive Variance Reduction
Aleksandr Shestakov, Valery Parfenov, Aleksandr Beznosikov

TL;DR
This paper introduces a unified framework for variance reduction in stochastic optimization, including biased estimators and adaptive step sizes, eliminating the need for hyperparameter tuning and demonstrating broad applicability and effectiveness.
Contribution
It generalizes existing variance reduction methods to include biased estimators and proposes new adaptive step size algorithms that require no hyperparameter tuning.
Findings
Adaptive variance reduction methods outperform traditional ones.
The proposed methods are effective across various optimization tasks.
Numerical experiments confirm improved convergence and usability.
Abstract
Variance reduction is a family of powerful mechanisms for stochastic optimization that appears to be helpful in many machine learning tasks. It is based on estimating the exact gradient with some recursive sequences. Previously, many papers demonstrated that methods with unbiased variance-reduction estimators can be described in a single framework. We generalize this approach and show that the unbiasedness assumption is excessive; hence, we include biased estimators in this analysis. But the main contribution of our work is the proposition of new variance reduction methods with adaptive step sizes that are adjusted throughout the algorithm iterations and, moreover, do not need hyperparameter tuning. Our analysis covers finite- sum problems, distributed optimization, and coordinate methods. Numerical experiments in various tasks validate the effectiveness of our methods.
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper proposes a unified theoretical framework that encompasses both unbiased and biased gradient estimators for variance reduction stochastic methods. 2. A contribution of this paper is the development of adaptive step size schedules that eliminate the need for hyperparameter tuning. 3. It provides comprehensive convergence guarantees for three critical settings including non-convex optimization, PL condition and adaptive step sizes.
1. While Assumption 1 is the paper's theoretical centerpiece, it provides limited guidance on how to calibrate constants for new VR methods. 2. The adaptive step size relies on a hyperparameter, but the paper provides limited insight into its practical choice or theoretical impact. 3. The related work section (Section 2) mentions adaptive methods like STORM (Cutkosky & Orabona, 2019) and Prodigy (Mishchenko & Defazio, 2023) but fails to provide a direct, quantitative comparison of the proposed
1. The analysis encompasses biased and unbiased estimators across finite-sum, distributed, and coordinate methods, filling gaps left by prior unified frameworks. 2. The proposed step-size rule depends only on observable quantities, removes reliance on L or µ, and achieves optimal O(1/√T) nonconvex rates and linear PL rates. 3. Ablations over batch sizes, probabilities, compression levels, clients, and coordinate sketch sizes indicate the adaptive variants consistently outperform tuned constant-s
1. Evaluation centers on a single dataset (a9a) and a simple logistic-regression task, with no large-scale nonconvex/deep-learning benchmarks, non-iid federated settings, or real distributed systems. 2. Although the schedule is “parameter-free,” users still choose α; constants (A, B, C, ρ) governing bounds are estimator-specific; links between theoretical and deployable code are not fully operationalized. 3. Claims about extending unified analyses to biased estimators and being first to provide
1. The proposed Assumption 1 captures the recursive behavior of variance-reduced estimators across diverse settings, allowing inclusion of a wider range of methods. 2. The adaptive step sizes, based on accumulated gradient norms, avoid dependence on unknown constants, making them practical for real-world applications. Asymptotically optimal rates are preserved, and theorems provide clear bounds. 3. The work unifies finite-sum, distributed (with compression to reduce communication), and coordinat
1. In my view, the paper’s main weakness is that its conclusions are not new. Prior adaptive methods [1,2] have already achieved the optimal convergence rate for finite-sum problems, and this work does not provide new results. More importantly, the proposed proof technique is not very different from earlier papers [1,2]. Although previous work focused on a single variance-reduction method, this paper’s contribution and level of innovation seem more like extending the previously introduced adapti
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference
