Understanding AdamW through Proximal Methods and Scale-Freeness

Zhenxun Zhuang; Mingrui Liu; Ashok Cutkosky; Francesco Orabona

arXiv:2202.00089·cs.LG·February 2, 2022·36 cites

Understanding AdamW through Proximal Methods and Scale-Freeness

Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, Francesco Orabona

PDF

Open Access 1 Repo

TL;DR

This paper explains the advantages of AdamW over Adam-$\\ell_2$ by interpreting it as a proximal gradient method and exploring its scale-free property, supported by empirical evidence across various deep learning tasks.

Contribution

It provides a novel interpretation of AdamW as a proximal method and links its scale-invariance to improved generalization in deep learning.

Findings

01

AdamW can be viewed as an approximation of a proximal gradient method.

02

AdamW's scale-freeness correlates with its performance advantages.

03

Empirical results across multiple tasks support the hypothesis about scale-invariance benefits.

Abstract

Adam has been widely adopted for training deep neural networks due to less hyperparameter tuning and remarkable performance. To improve generalization, Adam is typically used in tandem with a squared $ℓ_{2}$ regularizer (referred to as Adam- $ℓ_{2}$ ). However, even better performance can be obtained with AdamW, which decouples the gradient of the regularizer from the update rule of Adam- $ℓ_{2}$ . Yet, we are still lacking a complete explanation of the advantages of AdamW. In this paper, we tackle this question from both an optimization and an empirical point of view. First, we show how to re-interpret AdamW as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam- $ℓ_{2}$ . Next, we consider the property of "scale-freeness" enjoyed by AdamW and by its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhenxun-zhuang/adamw-scale-free
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques

MethodsAdam · AdamW