Second-order step-size tuning of SGD for non-convex optimization
Camille Castera, J\'er\^ome Bolte, C\'edric F\'evotte, Edouard Pauwels

TL;DR
This paper introduces Step-Tuned SGD, a second-order step-size adaptation method for non-convex optimization that improves training efficiency and accuracy in deep learning by estimating curvature with local quadratic models.
Contribution
It proposes a novel second-order step-size tuning method for SGD using local curvature estimation, enhancing convergence and performance in deep neural network training.
Findings
Faster convergence to critical points.
Better test accuracy compared to SGD, RMSprop, ADAM.
Observed loss drops during training stages.
Abstract
In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Sparse and Compressive Sensing Techniques
MethodsStochastic Gradient Descent
