Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling
Teodora Sre\'ckovi\'c, Jonas Geiping, Antonio Orvieto

TL;DR
This paper investigates the optimizer gap between Adam and SGD in language modeling, revealing that with proper tuning, SGD with momentum can match Adam's performance in small-batch scenarios, and provides new insights into training dynamics.
Contribution
It demonstrates that SGD with momentum can perform comparably to Adam in small-batch settings when properly tuned, challenging previous explanations for Adam's superiority.
Findings
SGD with momentum matches Adam in small-batch training when tuned
Existing explanations like class imbalance and sharpness do not fully explain the gap
Batch size significantly influences training dynamics as shown by SDE models
Abstract
Adam is known to perform significantly better than Stochastic Gradient Descent (SGD) in language models, a phenomenon for which a number of explanations have been proposed. In this work, we revisit this "optimizer gap" through a series of comprehensively tuned baseline training runs for language modeling with Transformers. We exhaustively study how momentum, gradient clipping, and batch size affect the gap between SGD and Adam. Our empirical findings show that SGD with momentum can actually perform similarly to Adam in small-batch settings, if tuned correctly. We revisit existing explanations for Adam's advantage, including heavy-tailed class imbalance, directional sharpness, and Hessian heterogeneity, which struggle to directly explain this phenomenon. Towards bridging this gap in our understanding, by analyzing our Transformer training runs and simple quadratic settings inspired by…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper tries to understand the gap between Adam and SGD based on the observation that the gap diminishes at small batch sizes.
The observation that Adam and SGD gap should diminish at small batch sizes is already supported by various previous theoretical works [1,2]. These papers show that the benefits of preconditioning decrease at small batch sizes, even for quadratics. Thus, the observation is expected. Secondly, first half of the paper provides misleading results as the $\beta_2$ of Adam is not properly tuned for small batch sizes. In my opinion, this is a significant misrepresentation for the first half of the pap
1. The paper presents an interesting observation about training dynamics in the small batch regime that is not easily explained by some existing models. In particular any model that suggests Adam is better than SGD needs to account for why that property does not hold at small batch sizes. 2. The experiments seem to be rigorous with appropriate tuning and ablations.
1. The SDE explanation is not really fully fleshed out. First, in the figure, it is the noise level not the batch size that is varied (I understand that they are connected, but it should be possible to make an experiment with batch if so). Moreover, it is not clear that the gradient noise model is a good one in the full language modeling case. If this is the main explanation provided by the paper, there should be some more clear experiments trying to substantiate the model on real data. For exam
1. The author tried small batch experiments in various setting to test whether previous theoretical explanations hold. Most experiments are conducted in a scientific and fair way with ablation study. 2. The result that SGD won’t break and can even match the performance of Adam with much more iterations is a bit novel. I am not aware other papers that actually run SGD for so many iterations.
1. The main finding lacks novelty. Kunster et al.2023 already shows in their figure 3 and figure 4 that the gap between Adam and SGD decreases or even disappears at small size while this paper also notes the consistency and similarity with previous results. 2. The setting of small batch size and more update iterations is practically irrelevant and this paper also admits that this setting is inconvenient in training LLMs. Therefore, this paper doesn’t help to answer this important question that w
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · Stochastic Gradient Descent
