Theoretical analysis of Adam using hyperparameters close to one without   Lipschitz smoothness

Hideaki Iiduka

arXiv:2206.13290·cs.LG·June 28, 2022

Theoretical analysis of Adam using hyperparameters close to one without Lipschitz smoothness

Hideaki Iiduka

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of Adam optimizer without assuming Lipschitz smoothness, showing it performs well with small learning rates, hyperparameters near one, and large batch sizes, aligning theory with practical observations.

Contribution

It offers the first theoretical analysis of Adam without Lipschitz smoothness assumptions, demonstrating effectiveness with hyperparameters close to one and small learning rates.

Findings

01

Adam performs well with hyperparameters close to one.

02

Small learning rates improve Adam's performance.

03

Large batch sizes enhance Adam's effectiveness.

Abstract

Convergence and convergence rate analyses of adaptive methods, such as Adaptive Moment Estimation (Adam) and its variants, have been widely studied for nonconvex optimization. The analyses are based on assumptions that the expected or empirical average loss function is Lipschitz smooth (i.e., its gradient is Lipschitz continuous) and the learning rates depend on the Lipschitz constant of the Lipschitz continuous gradient. Meanwhile, numerical evaluations of Adam and its variants have clarified that using small constant learning rates without depending on the Lipschitz constant and hyperparameters ( $β_{1}$ and $β_{2}$ ) close to one is advantageous for training deep neural networks. Since computing the Lipschitz constant is NP-hard, the Lipschitz smoothness condition would be unrealistic. This paper provides theoretical analyses of Adam without assuming the Lipschitz smoothness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Machine Learning and ELM

MethodsAdam