Why gradient clipping accelerates training: A theoretical justification   for adaptivity

Jingzhao Zhang; Tianxing He; Suvrit Sra; Ali Jadbabaie

arXiv:1905.11881·math.OC·February 12, 2020·85 cites

Why gradient clipping accelerates training: A theoretical justification for adaptivity

Jingzhao Zhang, Tianxing He, Suvrit Sra, Ali Jadbabaie

PDF

Open Access 1 Repo

TL;DR

This paper offers a theoretical explanation for why gradient clipping speeds up training deep neural networks, highlighting the importance of variable gradient smoothness and introducing a new, weaker smoothness condition.

Contribution

It introduces a novel relaxation of gradient smoothness assumptions and proves that gradient clipping and normalized gradient methods can converge faster than standard gradient descent.

Findings

01

Gradient smoothness varies significantly during training.

02

Gradient smoothness correlates positively with gradient norm.

03

Gradient clipping accelerates convergence in practice.

Abstract

We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JingzhaoZhang/why-clipping-accelerates
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Domain Adaptation and Few-Shot Learning

MethodsGradient Clipping