TL;DR
This paper investigates how the implicit bias of the Adam optimizer varies between full-batch and incremental (per-sample) regimes, revealing that incremental Adam can favor different classifiers depending on the dataset, unlike Signum.
Contribution
It characterizes the implicit bias of incremental Adam on linearly separable data, showing divergence from full-batch behavior and introducing a data-dependent Mahalanobis-norm margin analysis.
Findings
Incremental Adam can converge to the $ ext{l}_2$-max-margin classifier.
Full-batch Adam favors the $ ext{l}_ ext{infty}$-max-margin classifier.
Signum consistently converges to the $ ext{l}_ extinfty$-max-margin classifier regardless of batch size.
Abstract
Adam [Kingma & Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with -geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and show that its bias can deviate from the full-batch behavior. As an extreme example, we construct datasets on which incremental Adam provably converges to the -max-margin classifier, in contrast to the -max-margin bias of full-batch Adam. For general datasets, we characterize its bias using a proxy algorithm for the limit. This proxy maximizes a data-adaptive Mahalanobis-norm margin, whose associated covariance matrix is determined by a data-dependent…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is well-written and easy to follow, overall. 2. It studies an important problem (what is the effect of batch size on the implicit bias of Adam?), characterizes the implicit bias of per-sample Adam under some conditions and provides interesting insights about how can be different compared to standard max-margin solutions .
There are a few weaknesses that the paper should address: 1. The paper is missing discussion on some related works that study the implicit bias of Adam in neural networks [1-2], and effect of momentum parameter values [3] and rotations [4]. There is also another paper [5] on effect of batch sizes on SGD and Adam that should be cited in Section 7. 2. In Fig. 1 (left), the cosine similarity of Adam iterates with the $\ell_2$-max-margin solution does not converge to 1. This is later clarified in
1. The paper explicitly points out that previous literature only considers the implicit bias of Adam under deterministic setting and suggest that noise can have different influence on Adam and signum. It can expand the study of Adam. 2. There is one rigorous theoretical example on the Generalized Rademacher data that explicitly shows that Adam will converge $\ell_2$-max-margin. 3. There are many synthetic experiments that show the solution that stochastic Adam converges to is closer to $\ell_2
1. The theory for stochastic Adam is not complete on general data. After the stage of theorem 4.8, the fixed point analysis can only be done by simulation of algorithm 3, which is less satisfying. Also the proposed AdamProxy almost reduces to AdaGrad. So it is not sure whether the analysis really suits for Adam, which works with a fixed $\beta_2<1$. 2. The assumption 4.4 is too strong, which requires the existence of the converged direction. In comparison, Zhang et al., 2024 doesn’t need to ass
1. Considering that the real practice of Adam is usually mini-batch instead of full-batch, studying the implicit bias of stochastic Adam is of great importance. 2. The conclusions of this paper are fairly complete and abundant, covering the stochastic Adam and Signum.
1. **Strong technical assumptions**: The critical assumption that $\beta_2\to 1$ essentially makes the denominator of Adam's updates remain invariant for each data. From the theoretical perspective, such a simplification directly renders that stochastic Adam has the same updating direction as a specific preconditioned full-batch GD, which appears to be the direct reason for the derived implicit bias. The central question, however, is whether this assumption is justifiable for drawing conclusions
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
