On the Implicit Bias of Adam

Matias D. Cattaneo; Jason M. Klusowski; Boris Shigida

arXiv:2309.00079·cs.LG·June 18, 2024

On the Implicit Bias of Adam

Matias D. Cattaneo, Jason M. Klusowski, Boris Shigida

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates the implicit regularization effects of Adam and RMSProp optimizers through backward error analysis, revealing how their hyperparameters influence the norm-based penalties on gradients and potentially affect generalization.

Contribution

It extends previous analysis of gradient descent to Adam and RMSProp, showing how their implicit biases depend on hyperparameters and training stages, with implications for generalization.

Findings

01

Implicit regularization in Adam and RMSProp depends on hyperparameters and training stage.

02

The regularization involves either the (perturbed) one-norm or impedes its reduction.

03

Numerical experiments illustrate the impact on generalization.

Abstract

In previous literature, backward error analysis was used to find ordinary differential equations (ODEs) approximating the gradient descent trajectory. It was found that finite step sizes implicitly regularize solutions because terms appearing in the ODEs penalize the two-norm of the loss gradients. We prove that the existence of similar implicit regularization in RMSProp and Adam depends on their hyperparameters and the training stage, but with a different "norm" involved: the corresponding ODE terms either penalize the (perturbed) one-norm of the loss gradients or, conversely, impede its reduction (the latter case being typical). We also conduct numerical experiments and discuss how the proven facts can influence generalization.

Peer Reviews

Decision·ICML 2024 Poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- The paper provides detailed backward error analysis for both Adam and RMSProp. The author is able to show that Adam has bias terms that penalize $1-$ norm, $2-$ norm, or $-1-$ norm depending on the settings of $\beta_1$ and $\beta_2$ in Adam. - The paper's result in the implicit bias might help explain the difference in the generalization ability of Adaptive Algorithms and GD algorithms. - The numerical experiments confirm the theoretical results. - The paper is well-written overall.

Weaknesses

- Some of the graphs are a bit confusing since the $x$ and $y$ axes are not labeled carefully. More explanation and discussion on these graphs would be appreciated. - Some transformer tasks might be helpful to see if we can see consistent behaviors in the $1-norm$ across different domains. If I'm not mistaken, Adam generalizes better than SGD in transformer related tasks which slightly contradicts the first conclusion in the discussion section.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- The topic of the paper, implicit bias of first order algorithms, is an active field of research with many recent results. So far, characterizing the implicit bias of Adam and other preconditioned methods has not been easy. - The paper, to my understanding, seems to present a novel result corroborated by some empirical evidence in dimension 2.

Weaknesses

- The writing of the paper seems subpar to me, and would benefit from being thoroughly proofread. In some locations it sounded very informal/colloquial, eg "which is ``eaten'' by the gradient". - The analysis, though interesting, is also handwavy: see questions below.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

* The overall contents are easily understandable. * A timely issue, ADAM's modified loss, is addressed in the paper. * Though I did not review all the content in the Supplementary material, it appears technically correct. * The authors validated their findings using a Toy problem (Section 5) and practical neural networks (Section 6).

Weaknesses

* While the overall content is easily comprehensible, specific details are inaccessible unless one reviews the supplementary material. Specifically, many theoretical works provide a sketch of the proof in the main paper to explain the techniques newly considered/developed by the authors and clarify how they differ from existing techniques. This paper lacks such details. * In addition, the experimental setup in Section 6 is not self-contained and has elements that seem arbitrary, making cherry-p

Code & Models

Repositories

borshigida/implicit-bias-of-adam
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Numerical methods in inverse problems · Numerical methods in engineering

MethodsAdam · RMSProp