To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in   High-Dimensions

Noah Marshall; Ke Liang Xiao; Atish Agarwala; Elliot Paquette

arXiv:2406.11733·stat.ML·October 8, 2024

To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

Noah Marshall, Ke Liang Xiao, Atish Agarwala, Elliot Paquette

PDF

Open Access 1 Repo

TL;DR

This paper provides a theoretical analysis of gradient clipping in high-dimensional stochastic gradient descent, revealing its effects on learning dynamics and practical benefits in noisy settings, supported by experiments.

Contribution

It introduces a deterministic equation modeling clipped SGD in high dimensions and proposes a heuristic for optimal clipping threshold scheduling.

Findings

01

Clipping does not improve performance with Gaussian noise.

02

In other noisy environments, clipping can be beneficial with proper tuning.

03

The proposed heuristic simplifies hyperparameter tuning for clipping thresholds.

Abstract

The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clipping: a practical procedure with limited theoretical underpinnings. In this work, we study clipping in a least squares problem under streaming SGD. We develop a theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality. In this limit we find a deterministic equation that describes the evolution of the loss and demonstrate that this equation predicts the path of clipped SGD on synthetic, CIFAR10, and Wikitext2 data. We show that with Gaussian noise clipping cannot improve SGD performance. Yet, in other noisy settings, clipping can provide benefits with tuning of the clipping…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nmarzz/clip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Thermodynamics and Statistical Mechanics

MethodsStochastic Gradient Descent