First-ish Order Methods: Hessian-aware Scalings of Gradient Descent
Oscar Smee, Fred Roosta, Stephen J. Wright

TL;DR
This paper introduces a Hessian-aware scaling method for gradient descent that adaptively adjusts step sizes based on curvature, improving convergence and reducing tuning in large-scale machine learning optimization.
Contribution
It proposes a novel Hessian-aware scaling technique that guarantees local unit step size and achieves linear convergence near minima, with global convergence under weaker smoothness assumptions.
Findings
Method achieves linear convergence near local minima.
Global convergence is proven under weaker smoothness conditions.
Empirical validation shows improved performance on machine learning tasks.
Abstract
Gradient descent is the primary workhorse for optimizing large-scale problems in machine learning. However, its performance is highly sensitive to the choice of the learning rate. A key limitation of gradient descent is its lack of natural scaling, which often necessitates expensive line searches or heuristic tuning to determine an appropriate step size. In this paper, we address this limitation by incorporating Hessian information to scale the gradient direction. By accounting for the curvature of the function along the gradient, our adaptive, Hessian-aware scaling method ensures a local unit step size guarantee, even in nonconvex settings. Near a local minimum that satisfies the second-order sufficient conditions, our approach achieves linear convergence with a unit step size. We show that our method converges globally under a significantly weaker version of the standard Lipschitz…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Model Reduction and Neural Networks
