TL;DR
This paper introduces a mathematically grounded principle for optimizer selection based on gradient alignment, leading to dynamic momentum rules that improve training efficiency across various tasks.
Contribution
It formulates optimizer selection as an inner product maximization problem and derives simple dynamic momentum rules with theoretical stability guarantees.
Findings
Dynamic momentum rules match or outperform fixed hyperparameters.
The approach reduces the need for manual hyperparameter tuning.
Experiments demonstrate improved training speed across multiple domains.
Abstract
Recent works have shown that gradient-update alignment is a powerful signal for modulating optimizer updates, often leading to faster training. We promote this update-wise heuristic as a mathematically grounded principle for selecting and tuning optimizer hyperparameters. By treating gradients and updates as signals and an optimizer as a causal filter that maps between them, we formulate optimizer selection as maximizing the expected drop rate in loss over a prescribed family of optimizers. We show that this objective is exactly the inner product between the optimizer filter and the gradient autocorrelation, and prove that a greedy optimum exists and has a stability bound under perturbations of the estimated gradient statistics. Specializing in momentum-based optimizers, the theory yields simple dynamic momentum selection rules for both SGD+Momentum and Adam/AdamW. Experiments across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
