Adaptive Momentum and Nonlinear Damping for Neural Network Training
Aikaterini Karoni, Rajit Rajpal, Benedict Leimkuhler, Gabriel Stoltz

TL;DR
This paper introduces an adaptive momentum scheme with nonlinear damping for neural network training, improving stability and convergence by adjusting to local landscape curvature, and demonstrates its effectiveness on large-scale models.
Contribution
It presents a novel continuous-time adaptive momentum method with cubic damping, providing both theoretical convergence guarantees and empirical performance improvements.
Findings
Robust training of large-scale models like ViT, BERT, GPT2.
Outperforms or matches Adam in experiments.
Theoretically proven exponential convergence.
Abstract
We propose a continuous-time scheme for large-scale optimization that introduces individual, adaptive momentum coefficients regulated by the kinetic energy of each model parameter. This approach automatically adjusts to local landscape curvature to maintain stability without sacrificing convergence speed. We demonstrate that our adaptive friction can be related to cubic damping, a suppression mechanism from structural dynamics. Furthermore, we introduce two specific optimization schemes by augmenting the continuous dynamics of mSGD and Adam with a cubic damping term. Empirically, our methods demonstrate robustness and match or outperform Adam on training ViT, BERT, and GPT2 tasks where mSGD typically struggles. We further provide theoretical results establishing the exponential convergence of the proposed schemes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Stochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference
