Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM   in Deep Learning

Pan Zhou; Jiashi Feng; Chao Ma; Caiming Xiong; Steven Hoi; Weinan E

arXiv:2010.05627·cs.LG·November 30, 2021·57 cites

Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

Pan Zhou, Jiashi Feng, Chao Ma, Caiming Xiong, Steven Hoi, Weinan E

PDF

Open Access 1 Video

TL;DR

This paper provides a theoretical analysis explaining why SGD generalizes better than ADAM by examining their local convergence behaviors and the impact of gradient noise tails on escaping sharp minima.

Contribution

It introduces a Levy-driven SDE framework to analyze the escaping times from local basins, revealing why SGD better escapes sharp minima compared to ADAM.

Findings

01

SGD has smaller escaping times from sharp minima than ADAM.

02

Heavy-tailed gradient noise influences the stability and escape behavior of algorithms.

03

Experimental results support the heavy-tailed noise assumption and theoretical insights.

Abstract

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Model Reduction and Neural Networks · Neural Networks and Applications

MethodsStochastic Gradient Descent