Noisy Truncated SGD: Optimization and Generalization
Yingxue Zhou, Xinyan Li, Arindam Banerjee

TL;DR
This paper introduces Noisy Truncated SGD, a variant of stochastic gradient descent that truncates small gradients and adds noise, improving convergence, escaping saddle points, and enhancing generalization in deep learning.
Contribution
It provides a rigorous analysis of T-SGD and NT-SGD, showing their convergence rates, stability, and ability to escape saddle points, with empirical validation on standard datasets.
Findings
NT-SGD matches vanilla SGD in speed and accuracy
NT-SGD effectively escapes saddle points due to added noise
NT-SGD achieves better generalization bounds than T-SGD
Abstract
Recent empirical work on stochastic gradient descent (SGD) applied to over-parameterized deep learning has shown that most gradient components over epochs are quite small. Inspired by such observations, we rigorously study properties of Truncated SGD (T-SGD), that truncates the majority of small gradient components to zeros. Considering non-convex optimization problems, we show that the convergence rate of T-SGD matches the order of vanilla SGD. We also establish the generalization error bound for T-SGD. Further, we propose Noisy Truncated SGD (NT-SGD), which adds Gaussian noise to the truncated gradients. We prove that NT-SGD has the same convergence rate as T-SGD for non-convex optimization problems. We demonstrate that with the help of noise, NT-SGD can provably escape from saddle points and requires less noise compared to previous related work. We also prove that NT-SGD achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
MethodsStochastic Gradient Descent
