Stein-Rule Shrinkage for Stochastic Gradient Estimation in High Dimensions
M. Arashi, M. Amintoosi

TL;DR
This paper introduces a Stein-rule shrinkage framework for stochastic gradient estimation in high-dimensional deep learning, leading to an improved optimizer called SR-Adam that outperforms standard Adam in noisy, large-batch settings.
Contribution
It develops a novel high-dimensional shrinkage estimator for stochastic gradients, integrated into Adam, with theoretical optimality and practical improvements demonstrated on image classification tasks.
Findings
SR-Adam outperforms Adam in large-batch regimes.
Shrinkage applied to convolutional layers yields most gains.
The method is minimax-optimal under Gaussian noise assumptions.
Abstract
Stochastic gradient methods are central to large-scale learning, but they treat mini-batch gradients as unbiased estimators, which classical decision theory shows are inadmissible in high dimensions. We formulate gradient computation as a high-dimensional estimation problem and introduce a framework based on Stein-rule shrinkage. We construct a gradient estimator that adaptively contracts noisy mini-batch gradients toward a stable estimator derived from historical momentum. The shrinkage intensity is determined in a data-driven manner using an online estimate of gradient noise variance, leveraging statistics from adaptive optimizers. Under a Gaussian noise model, we show our estimator uniformly dominates the standard stochastic gradient under squared error loss and is minimax-optimal. We incorporate this into the Adam optimizer, yielding SR-Adam, a practical algorithm with negligible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Gaussian Processes and Bayesian Inference
