Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization
Aleksandar Armacki, Dragana Bajovi\'c, Du\v{s}an Jakoveti\'c, Soummya Kar, Ali H. Sayed

TL;DR
This paper investigates the long-term tail decay behavior of SGD and clipped SGD in non-convex optimization, establishing faster decay rates and tight bounds that improve understanding of algorithm reliability over extensive training periods.
Contribution
It introduces new theoretical bounds on the tail decay rates of SGD and c-SGD, revealing faster decay regimes and providing tight bounds that surpass previous finite-time results.
Findings
SGD tail decay rate is at most e^{-t/ ext{log}(t)}.
c-SGD with heavy-tailed noise has decay rate e^{-t^{eta_p}/ ext{log}(t)}.
Rates are tight up to poly-logarithmic factors, faster than previous bounds.
Abstract
The study of tail behaviour of SGD-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of SGD-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Complexity and Algorithms in Graphs · Sparse and Compressive Sensing Techniques
