Variance-reduced Clipping for Non-convex Optimization
Amirhossein Reisizadeh, Haochuan Li, Subhro Das, Ali Jadbabaie

TL;DR
This paper introduces a variance-reduced clipping method for non-convex optimization that improves theoretical complexity bounds and demonstrates competitive empirical performance in deep learning tasks.
Contribution
It develops a variance reduction technique combined with gradient clipping under relaxed smoothness assumptions, achieving order-optimal complexity bounds.
Findings
Improves stochastic gradient complexity to O(ε^{-3}) using SPIDER.
Achieves order-optimal complexity for finite-sum problems with O(√n ε^{-2} + n).
Empirically outperforms or matches existing variance-reduced methods in vision tasks.
Abstract
Gradient clipping is a standard training technique used in deep learning applications such as large-scale language modeling to mitigate exploding gradients. Recent experimental studies have demonstrated a fairly special behavior in the smoothness of the training objective along its trajectory when trained with gradient clipping. That is, the smoothness grows with the gradient norm. This is in clear contrast to the well-established assumption in folklore non-convex optimization, a.k.a. --smoothness, where the smoothness is assumed to be bounded by a constant globally. The recently introduced --smoothness is a more relaxed notion that captures such behavior in non-convex optimization. In particular, it has been shown that under this relaxed smoothness assumption, SGD with clipping requires stochastic gradient computations to find an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Machine Learning and ELM
MethodsStochastic Gradient Descent
