A Bootstrap Perspective on Stochastic Gradient Descent
Hongjian Lan, Yucong Liu, Florian Sch\"afer

TL;DR
This paper explores how stochastic gradient descent (SGD) improves generalization in machine learning by implicitly regularizing solution variability through bootstrap-like sampling, supported by empirical and theoretical analysis.
Contribution
It introduces a bootstrap perspective on SGD, showing that it implicitly regularizes the trace of the gradient covariance matrix to enhance generalization.
Findings
SGD favors solutions robust under resampling.
Explicit regularization of gradient covariance improves test performance.
SGD controls solution sensitivity to sampling noise.
Abstract
Machine learning models trained with \emph{stochastic} gradient descent (SGD) can generalize better than those trained with deterministic gradient descent (GD). In this work, we study SGD's impact on generalization through the lens of the statistical bootstrap: SGD uses gradient variability under batch sampling as a proxy for solution variability under the randomness of the data collection process. We use empirical results and theoretical analysis to substantiate this claim. In idealized experiments on empirical risk minimization, we show that SGD is drawn to parameter choices that are robust under resampling and thus avoids spurious solutions even if they lie in wider and deeper minima of the training loss. We prove rigorously that by implicitly regularizing the trace of the gradient covariance matrix, SGD controls the algorithmic variability. This regularization leads to solutions…
Peer Reviews
Decision·Submitted to ICLR 2026
This paper is clearly written and has a nice structure.
Although the authors provide an upper bound on the generalization error via algorithmic stability, the paper does not explicitly establish how SGD regularizes this term theoretically. Moreover, there is no theoretical characterization of the generalization gap between SGD and GD. Another concern arises from the assumptions: while Assumption 1 appears standard, Assumption 2 is rather demanding and may not hold in many scenarios: existing theoretical results generally suggest that the upper bound
1. The top example in Section 2 is attractive and illustrative.
1. The presentation of the theoretical part is a bit confusing. - The theoretical results are listed as Lemmas 1 and 2 as well as Proposition 1, without a theorem that usually serves as the center of discussions. This makes me confused about what is the main theoretical contribution of the paper. - The discussions after Lemmas 1 and 2 mainly discuss why the lemmas hold, and do not actually help with the understanding of the theoretical results (especially for Lemma 2, whose righthand side has a
The question raised in the paper is important and the paper tsts a new regularization method based on the analyses and shows that it might benefit generalization
The theoretical contribution appears to be incremental, as, to my understanding, the main insights came from Smith et al. (2021). The empirical evaluation is very limited, as the results are tested only on a very specific synthetic dataset with a sparse prior and FashionMNIST.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Generative Adversarial Networks and Image Synthesis
