On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Sadhika Malladi; Kaifeng Lyu; Abhishek Panigrahi; Sanjeev Arora

arXiv:2205.10287·cs.LG·November 4, 2024·1 cites

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, Sanjeev Arora

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper derives stochastic differential equation approximations for adaptive gradient methods like RMSprop and Adam, providing theoretical guarantees and practical scaling rules for hyperparameters in large-scale deep learning.

Contribution

It introduces rigorous SDE approximations for RMSprop and Adam, enabling better theoretical understanding and practical hyperparameter scaling in deep learning.

Findings

01

SDE approximations for RMSprop and Adam are validated theoretically and experimentally.

02

A square root scaling rule for hyperparameters with batch size changes is proposed and validated.

03

The methods improve understanding and tuning of adaptive optimizers in large-scale settings.

Abstract

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $square root scaling rule$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

abhishekpanigrahi1996/Adaptive-SDE
pytorchOfficial

Videos

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Domain Adaptation and Few-Shot Learning

MethodsRMSProp · Stochastic Gradient Descent · Adam