Adam with model exponential moving average is effective for nonconvex   optimization

Kwangjun Ahn; Ashok Cutkosky

arXiv:2405.18199·cs.LG·October 31, 2024·2 cites

Adam with model exponential moving average is effective for nonconvex optimization

Kwangjun Ahn, Ashok Cutkosky

PDF

Open Access 1 Video

TL;DR

This paper provides a theoretical analysis showing that a clipped Adam optimizer with model exponential moving average achieves optimal convergence in nonconvex settings, highlighting its advantages in complex model training.

Contribution

It offers the first theoretical proof of Adam with model EMA's effectiveness and optimal convergence rates in nonconvex optimization.

Findings

01

Clipped Adam with model EMA achieves optimal convergence rates.

02

Coordinate-wise adaptivity of Adam is provably beneficial.

03

Analysis emphasizes the importance of momentum and discounting in Adam.

Abstract

In this work, we offer a theoretical analysis of two modern optimization techniques for training large and complex models: (i) adaptive optimization algorithms, such as Adam, and (ii) the model exponential moving average (EMA). Specifically, we demonstrate that a clipped version of Adam with model EMA achieves the optimal convergence rates in various nonconvex optimization settings, both smooth and nonsmooth. Moreover, when the scale varies significantly across different coordinates, we demonstrate that the coordinate-wise adaptivity of Adam is provably advantageous. Notably, unlike previous analyses of Adam, our analysis crucially relies on its core elements -- momentum and discounting factors -- as well as model EMA, motivating their wide applications in practice.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Adam with model exponential moving average is effective for nonconvex optimization· slideslive

Taxonomy

TopicsAdvanced Optimization Algorithms Research

MethodsAdam