Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning
Pan Zhou, Jiashi Feng, Chao Ma, Caiming Xiong, Steven Hoi, Weinan E

TL;DR
This paper provides a theoretical analysis explaining why SGD generalizes better than ADAM by examining their local convergence behaviors and the impact of gradient noise tails on escaping sharp minima.
Contribution
It introduces a Levy-driven SDE framework to analyze the escaping times from local basins, revealing why SGD better escapes sharp minima compared to ADAM.
Findings
SGD has smaller escaping times from sharp minima than ADAM.
Heavy-tailed gradient noise influences the stability and escape behavior of algorithms.
Experimental results support the heavy-tailed noise assumption and theoretical insights.
Abstract
It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Model Reduction and Neural Networks · Neural Networks and Applications
MethodsStochastic Gradient Descent
