Theoretical analysis of Adam using hyperparameters close to one without Lipschitz smoothness
Hideaki Iiduka

TL;DR
This paper provides a theoretical analysis of Adam optimizer without assuming Lipschitz smoothness, showing it performs well with small learning rates, hyperparameters near one, and large batch sizes, aligning theory with practical observations.
Contribution
It offers the first theoretical analysis of Adam without Lipschitz smoothness assumptions, demonstrating effectiveness with hyperparameters close to one and small learning rates.
Findings
Adam performs well with hyperparameters close to one.
Small learning rates improve Adam's performance.
Large batch sizes enhance Adam's effectiveness.
Abstract
Convergence and convergence rate analyses of adaptive methods, such as Adaptive Moment Estimation (Adam) and its variants, have been widely studied for nonconvex optimization. The analyses are based on assumptions that the expected or empirical average loss function is Lipschitz smooth (i.e., its gradient is Lipschitz continuous) and the learning rates depend on the Lipschitz constant of the Lipschitz continuous gradient. Meanwhile, numerical evaluations of Adam and its variants have clarified that using small constant learning rates without depending on the Lipschitz constant and hyperparameters ( and ) close to one is advantageous for training deep neural networks. Since computing the Lipschitz constant is NP-hard, the Lipschitz smoothness condition would be unrealistic. This paper provides theoretical analyses of Adam without assuming the Lipschitz smoothness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Machine Learning and ELM
MethodsAdam
