An Inertial Newton Algorithm for Deep Learning
Camille Castera, J\'er\^ome Bolte, C\'edric F\'evotte, Edouard Pauwels

TL;DR
This paper introduces INNA, a second-order inertial optimization algorithm tailored for deep learning, combining Newton-like behavior with stochastic approximations, and demonstrates its convergence and competitive performance on benchmark tasks.
Contribution
The paper presents INNA, a novel second-order inertial method for deep learning that leverages loss geometry and proves its convergence, addressing spurious stationary points and enabling aggressive learning rates.
Findings
INNA achieves competitive results on deep learning benchmarks.
Theoretical convergence of INNA is established for deep learning problems.
Addresses spurious stationary points via $D$-criticality framework.
Abstract
We introduce a new second-order inertial optimization method for machine learning called INNA. It exploits the geometry of the loss function while only requiring stochastic approximations of the function values and the generalized gradients. This makes INNA fully implementable and adapted to large-scale optimization problems such as the training of deep neural networks. The algorithm combines both gradient-descent and Newton-like behaviors as well as inertia. We prove the convergence of INNA for most deep learning problems. To do so, we provide a well-suited framework to analyze deep learning loss functions involving tame optimization in which we study a continuous dynamical system together with its discrete stochastic approximations. We prove sublinear convergence for the continuous-time differential inclusion which underlies our algorithm. Additionally, we also show how standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Sparse and Compressive Sensing Techniques
