The Effect of Mini-Batch Noise on the Implicit Bias of Adam
Matias D. Cattaneo, Boris Shigida

TL;DR
This paper presents a theoretical framework analyzing how mini-batch noise affects Adam optimizer's implicit bias towards sharper or flatter minima, impacting generalization in multi-epoch training.
Contribution
It reveals how hyperparameters and influence implicit bias depending on batch size, guiding better hyperparameter choices for different batch regimes.
Findings
Higher increases anti-regularization at large batch sizes, hurting generalization.
At smaller batch sizes, the dependence of anti-regularization on reverses.
The default =0.9, =0.999 is optimal for small batches; larger batches benefit from closer to .
Abstract
With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on , ) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
