Noisy Truncated SGD: Optimization and Generalization

Yingxue Zhou; Xinyan Li; Arindam Banerjee

arXiv:2103.00075·cs.LG·October 19, 2021

Noisy Truncated SGD: Optimization and Generalization

Yingxue Zhou, Xinyan Li, Arindam Banerjee

PDF

Open Access

TL;DR

This paper introduces Noisy Truncated SGD, a variant of stochastic gradient descent that truncates small gradients and adds noise, improving convergence, escaping saddle points, and enhancing generalization in deep learning.

Contribution

It provides a rigorous analysis of T-SGD and NT-SGD, showing their convergence rates, stability, and ability to escape saddle points, with empirical validation on standard datasets.

Findings

01

NT-SGD matches vanilla SGD in speed and accuracy

02

NT-SGD effectively escapes saddle points due to added noise

03

NT-SGD achieves better generalization bounds than T-SGD

Abstract

Recent empirical work on stochastic gradient descent (SGD) applied to over-parameterized deep learning has shown that most gradient components over epochs are quite small. Inspired by such observations, we rigorously study properties of Truncated SGD (T-SGD), that truncates the majority of small gradient components to zeros. Considering non-convex optimization problems, we show that the convergence rate of T-SGD matches the order of vanilla SGD. We also establish the generalization error bound for T-SGD. Further, we propose Noisy Truncated SGD (NT-SGD), which adds Gaussian noise to the truncated gradients. We prove that NT-SGD has the same convergence rate as T-SGD for non-convex optimization problems. We demonstrate that with the help of noise, NT-SGD can provably escape from saddle points and requires less noise compared to previous related work. We also prove that NT-SGD achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning

MethodsStochastic Gradient Descent