Robust and Fast Training via Per-Sample Clipping

Davide Nobile; Philipp Grohs

arXiv:2605.02701·math.OC·May 5, 2026

Robust and Fast Training via Per-Sample Clipping

Davide Nobile, Philipp Grohs

PDF

TL;DR

This paper introduces a per-sample gradient clipping method for stochastic gradient descent that improves robustness and convergence in non-convex optimization, supported by theoretical analysis and empirical results.

Contribution

The paper presents PS-Clip-SGD, a novel gradient estimator with optimal convergence guarantees under heavy-tailed noise, and demonstrates its practical advantages over existing methods.

Findings

01

PS-Clip-SGD achieves optimal convergence rates in expectation.

02

It outperforms vanilla SGD with momentum and standard clipping in training AlexNet.

03

Clipping at the mini-batch level can improve training with negligible extra cost.

Abstract

We propose a robust gradient estimator based on per-sample gradient clipping and analyze its properties both theoretically and empirically. We show that the resulting method, per-sample clipped SGD (PS-Clip-SGD), achieves optimal in-expectation convergence rates for non-convex optimization problems under heavy-tailed gradient noise. Moreover, we establish high-probability convergence guarantees that match the in-expectation rates up to polylogarithmic factors in the failure probability. We complement our theoretical results with multiple numerical experiments. In particular, we demonstrate that PS-Clip-SGD outperforms both vanilla SGD with momentum and standard gradient clipping when training AlexNet on the CIFAR-100 dataset, even after accounting for the additional computational time caused by per-sample clipping. We also empirically show that, in the presence of gradient accumulation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.