Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning
Nikhil Vyas, Depen Morwani, Rosie Zhao, Gal Kaplun, Sham Kakade, Boaz, Barak

TL;DR
This paper investigates the role of SGD noise in online learning, finding that small batch sizes do not induce implicit bias benefits but mainly offer computational advantages, challenging previous beliefs.
Contribution
The study provides empirical evidence that SGD noise does not confer implicit bias in online learning, contrasting with offline learning, and offers a new perspective on SGD's role.
Findings
Small batch sizes do not provide implicit bias benefits in online learning.
SGD noise mainly offers computational advantages in online regimes.
SGD in online learning acts as noisy steps along the noiseless gradient path.
Abstract
The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by finite batch sizes ("SGD noise"). While prior works focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that small batch sizes do not confer any implicit bias advantages in online learning. In contrast to offline learning, the benefits of SGD noise in online learning are strictly computational, facilitating more cost-effective gradient steps. This suggests that SGD in the online regime can be construed as taking noisy steps along the "golden path" of the noiseless gradient descent algorithm. We study this hypothesis and provide supporting evidence in loss and function space. Our findings challenge the prevailing…
Peer Reviews
Decision·ICML 2024 Spotlight
1. It is very interesting that SGD noise plays a different role between single and multiple epoch regimes. 2. Figures are well-presented and convey succinct summary of experimental results. 3. The expressions "Fork in the Road" and "Golden Path" are eye-catching terms that create instant curiosity.
1. The paper is mostly well written; however, the details behind the experimental results are somewhat sparse, including the appendix. Some further clarifications would strengthen the paper substantially. For example, on page 5, it is stated that "To imitate the online regime with ImageNet, we only train for 10 epochs with data augmentation." In the abstract, online learning refers to the single epoch regime but on page 5, it seems that this is not the case. Furthermore, Appendix A contains very
1. The main result in this paper that SGD follows the same path in online learning settings is an interesting finding in my opinion. 2. The experiments support the main claims well, and the claims made by the paper are stated clearly in general.
1. I would like to understand more about the scope of the main results: - The experiments are performed on Resnet18, ConvNext-T, and GPT-2 small, which are relatively large models. I'm wondering if the main hypothesis of this paper also holds for smaller models, or if this phenomenon might be due to the overparameterization of the models? - The study of this paper focuses on SGD noise, i.e. the noise comes from not using full-batch. I'm wondering if the main hypothesis also holds for manually
The problem studied in this paper is important as the LLMs might adopt the examined online method to update their parameters. This paper performs extensive experiments to support their emperical findings.
The online learning setting investigated lacks a rigorous and detailed formulation. See more details in Questions.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and ELM · Neural Networks and Applications
MethodsStochastic Gradient Descent
