Refresh-Scaling the Memory of Balanced Adam
Alberto Fern\'andez-Hern\'andez, Cristian P\'erez-Corral, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ort\'i

TL;DR
This paper proposes a new perspective on Adam's momentum parameter, viewing it as a memory horizon that can be tuned for improved robustness across vision and language tasks.
Contribution
It introduces a refresh rule based on the memory horizon, improving Adam's robustness by adaptively setting the momentum parameter according to training scale.
Findings
Choosing the refresh count R_β≈1000 improves robustness.
The refresh rule reduces the maximum validation loss gap by 33.4%.
All experiments achieve within 1% of their validation oracle.
Abstract
Recent evidence suggests that Adam performs robustly when its momentum parameters are tied, , reducing the optimizer to a single remaining parameter. However, how this parameter should be set remains poorly understood. We argue that, in balanced Adam, should not be treated as a dimensionless constant: it defines a statistical memory horizon . In terms of the effective learning horizon , estimated from the validation trajectory, we study the refresh count , which measures how many times Adam renews its internal statistics during the useful phase of training. Across 11 vision and language experiments, we find that choosing so that selects different values depending on the training scale, yet improves robustness over the best fixed-beta baseline. Compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
