Robust and Fast Training via Per-Sample Clipping
Davide Nobile, Philipp Grohs

TL;DR
This paper introduces a per-sample gradient clipping method for stochastic gradient descent that improves robustness and convergence in non-convex optimization, supported by theoretical analysis and empirical results.
Contribution
The paper presents PS-Clip-SGD, a novel gradient estimator with optimal convergence guarantees under heavy-tailed noise, and demonstrates its practical advantages over existing methods.
Findings
PS-Clip-SGD achieves optimal convergence rates in expectation.
It outperforms vanilla SGD with momentum and standard clipping in training AlexNet.
Clipping at the mini-batch level can improve training with negligible extra cost.
Abstract
We propose a robust gradient estimator based on per-sample gradient clipping and analyze its properties both theoretically and empirically. We show that the resulting method, per-sample clipped SGD (PS-Clip-SGD), achieves optimal in-expectation convergence rates for non-convex optimization problems under heavy-tailed gradient noise. Moreover, we establish high-probability convergence guarantees that match the in-expectation rates up to polylogarithmic factors in the failure probability. We complement our theoretical results with multiple numerical experiments. In particular, we demonstrate that PS-Clip-SGD outperforms both vanilla SGD with momentum and standard gradient clipping when training AlexNet on the CIFAR-100 dataset, even after accounting for the additional computational time caused by per-sample clipping. We also empirically show that, in the presence of gradient accumulation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
