Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise

Xingyu Wang; Sewoong Oh; Chang-Han Rhee

arXiv:2102.04297·cs.LG·May 12, 2022

Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise

Xingyu Wang, Sewoong Oh, Chang-Han Rhee

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that truncated heavy-tailed SGD can effectively eliminate sharp minima in deep learning, leading to flatter minima and improved generalization, supported by theoretical analysis and empirical validation.

Contribution

It introduces a truncated heavy-tailed noise variant of SGD that can entirely avoid sharp minima, with theoretical characterization and practical evidence.

Findings

01

Truncation threshold influences escape time from minima.

02

Heavy-tailed truncated SGD resembles a Markov chain avoiding sharp minima.

03

Empirical results show improved generalization with gradient clipping.

Abstract

The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, as sharp minima are known to lead to poor generalization. Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learning tasks, and it was shown in \c{S}im\c{s}ekli (2019a,b) that SGD can escape sharp local minima under the presence of such heavy-tailed gradient noise, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima: it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsGradient Clipping · Stochastic Gradient Descent