Adam-family Methods with Decoupled Weight Decay in Deep Learning

Kuangyu Ding; Nachuan Xiao; Kim-Chuan Toh

arXiv:2310.08858·math.OC·October 16, 2023·2 cites

Adam-family Methods with Decoupled Weight Decay in Deep Learning

Kuangyu Ding, Nachuan Xiao, Kim-Chuan Toh

PDF

Open Access

TL;DR

This paper introduces a new theoretical framework for Adam-family optimization methods with decoupled weight decay, providing convergence guarantees and explaining empirical benefits in training nonsmooth neural networks.

Contribution

We propose a novel framework for Adam-family methods with decoupled weight decay, establishing convergence and unifying several existing algorithms.

Findings

01

The framework guarantees convergence under mild conditions.

02

AdamD outperforms Adam in generalization and efficiency.

03

The framework explains why decoupled weight decay improves performance.

Abstract

In this paper, we investigate the convergence properties of a wide class of Adam-family methods for minimizing quadratically regularized nonsmooth nonconvex optimization problems, especially in the context of training nonsmooth neural networks with weight decay. Motivated by the AdamW method, we propose a novel framework for Adam-family methods with decoupled weight decay. Within our framework, the estimators for the first-order and second-order moments of stochastic subgradients are updated independently of the weight decay term. Under mild assumptions and with non-diminishing stepsizes for updating the primary optimization variables, we establish the convergence properties of our proposed framework. In addition, we show that our proposed framework encompasses a wide variety of well-known Adam-family methods, hence offering convergence guarantees for these methods in the training of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Sparse and Compressive Sensing Techniques

MethodsWeight Decay · Adam · AdamW · Stochastic Gradient Descent