A Bootstrap Perspective on Stochastic Gradient Descent

Hongjian Lan; Yucong Liu; Florian Sch\"afer

arXiv:2512.07676·cs.LG·December 9, 2025

A Bootstrap Perspective on Stochastic Gradient Descent

Hongjian Lan, Yucong Liu, Florian Sch\"afer

PDF

Open Access 3 Reviews

TL;DR

This paper explores how stochastic gradient descent (SGD) improves generalization in machine learning by implicitly regularizing solution variability through bootstrap-like sampling, supported by empirical and theoretical analysis.

Contribution

It introduces a bootstrap perspective on SGD, showing that it implicitly regularizes the trace of the gradient covariance matrix to enhance generalization.

Findings

01

SGD favors solutions robust under resampling.

02

Explicit regularization of gradient covariance improves test performance.

03

SGD controls solution sensitivity to sampling noise.

Abstract

Machine learning models trained with \emph{stochastic} gradient descent (SGD) can generalize better than those trained with deterministic gradient descent (GD). In this work, we study SGD's impact on generalization through the lens of the statistical bootstrap: SGD uses gradient variability under batch sampling as a proxy for solution variability under the randomness of the data collection process. We use empirical results and theoretical analysis to substantiate this claim. In idealized experiments on empirical risk minimization, we show that SGD is drawn to parameter choices that are robust under resampling and thus avoids spurious solutions even if they lie in wider and deeper minima of the training loss. We prove rigorously that by implicitly regularizing the trace of the gradient covariance matrix, SGD controls the algorithmic variability. This regularization leads to solutions…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

This paper is clearly written and has a nice structure.

Weaknesses

Although the authors provide an upper bound on the generalization error via algorithmic stability, the paper does not explicitly establish how SGD regularizes this term theoretically. Moreover, there is no theoretical characterization of the generalization gap between SGD and GD. Another concern arises from the assumptions: while Assumption 1 appears standard, Assumption 2 is rather demanding and may not hold in many scenarios: existing theoretical results generally suggest that the upper bound

Reviewer 02Rating 2Confidence 2

Strengths

1. The top example in Section 2 is attractive and illustrative.

Weaknesses

1. The presentation of the theoretical part is a bit confusing. - The theoretical results are listed as Lemmas 1 and 2 as well as Proposition 1, without a theorem that usually serves as the center of discussions. This makes me confused about what is the main theoretical contribution of the paper. - The discussions after Lemmas 1 and 2 mainly discuss why the lemmas hold, and do not actually help with the understanding of the theoretical results (especially for Lemma 2, whose righthand side has a

Reviewer 03Rating 4Confidence 3

Strengths

The question raised in the paper is important and the paper tsts a new regularization method based on the analyses and shows that it might benefit generalization

Weaknesses

The theoretical contribution appears to be incremental, as, to my understanding, the main insights came from Smith et al. (2021). The empirical evaluation is very limited, as the results are tested only on a very specific synthetic dataset with a sparse prior and FashionMNIST.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Generative Adversarial Networks and Image Synthesis