Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

Beomhan Baek; Minhak Song; Chulhee Yun

arXiv:2510.26303·cs.LG·March 5, 2026

Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

Beomhan Baek, Minhak Song, Chulhee Yun

PDF

3 Reviews

TL;DR

This paper investigates how the implicit bias of the Adam optimizer varies between full-batch and incremental (per-sample) regimes, revealing that incremental Adam can favor different classifiers depending on the dataset, unlike Signum.

Contribution

It characterizes the implicit bias of incremental Adam on linearly separable data, showing divergence from full-batch behavior and introducing a data-dependent Mahalanobis-norm margin analysis.

Findings

01

Incremental Adam can converge to the $ ext{l}_2$-max-margin classifier.

02

Full-batch Adam favors the $ ext{l}_ ext{infty}$-max-margin classifier.

03

Signum consistently converges to the $ ext{l}_ extinfty$-max-margin classifier regardless of batch size.

Abstract

Adam [Kingma & Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $ℓ_{\infty}$ -geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and show that its bias can deviate from the full-batch behavior. As an extreme example, we construct datasets on which incremental Adam provably converges to the $ℓ_{2}$ -max-margin classifier, in contrast to the $ℓ_{\infty}$ -max-margin bias of full-batch Adam. For general datasets, we characterize its bias using a proxy algorithm for the $β_{2} \to 1$ limit. This proxy maximizes a data-adaptive Mahalanobis-norm margin, whose associated covariance matrix is determined by a data-dependent…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. The paper is well-written and easy to follow, overall. 2. It studies an important problem (what is the effect of batch size on the implicit bias of Adam?), characterizes the implicit bias of per-sample Adam under some conditions and provides interesting insights about how can be different compared to standard max-margin solutions .

Weaknesses

There are a few weaknesses that the paper should address: 1. The paper is missing discussion on some related works that study the implicit bias of Adam in neural networks [1-2], and effect of momentum parameter values [3] and rotations [4]. There is also another paper [5] on effect of batch sizes on SGD and Adam that should be cited in Section 7. 2. In Fig. 1 (left), the cosine similarity of Adam iterates with the $\ell_2$-max-margin solution does not converge to 1. This is later clarified in

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper explicitly points out that previous literature only considers the implicit bias of Adam under deterministic setting and suggest that noise can have different influence on Adam and signum. It can expand the study of Adam. 2. There is one rigorous theoretical example on the Generalized Rademacher data that explicitly shows that Adam will converge $\ell_2$-max-margin. 3. There are many synthetic experiments that show the solution that stochastic Adam converges to is closer to $\ell_2

Weaknesses

1. The theory for stochastic Adam is not complete on general data. After the stage of theorem 4.8, the fixed point analysis can only be done by simulation of algorithm 3, which is less satisfying. Also the proposed AdamProxy almost reduces to AdaGrad. So it is not sure whether the analysis really suits for Adam, which works with a fixed $\beta_2<1$. 2. The assumption 4.4 is too strong, which requires the existence of the converged direction. In comparison, Zhang et al., 2024 doesn’t need to ass

Reviewer 03Rating 4Confidence 5

Strengths

1. Considering that the real practice of Adam is usually mini-batch instead of full-batch, studying the implicit bias of stochastic Adam is of great importance. 2. The conclusions of this paper are fairly complete and abundant, covering the stochastic Adam and Signum.

Weaknesses

1. **Strong technical assumptions**: The critical assumption that $\beta_2\to 1$ essentially makes the denominator of Adam's updates remain invariant for each data. From the theoretical perspective, such a simplification directly renders that stochastic Adam has the same updating direction as a specific preconditioned full-batch GD, which appears to be the direct reason for the derived implicit bias. The central question, however, is whether this assumption is justifiable for drawing conclusions

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.