Promoting Exploration in Memory-Augmented Adam using Critical Momenta
Pranshu Malviya, Gon\c{c}alo Mordido, Aristide Baratin, Reza, Babanezhad Harikandeh, Jerry Huang, Simon Lacoste-Julien, Razvan Pascanu,, Sarath Chandar

TL;DR
This paper introduces a memory-augmented Adam optimizer that promotes exploration of flatter minima, leading to improved generalization in deep learning models across various tasks.
Contribution
A novel memory-augmented Adam optimizer that incorporates critical momentum buffers to encourage exploration of flatter minima, enhancing model generalization.
Findings
Improves exploration towards flatter minima in simple settings.
Enhances model performance on ImageNet, CIFAR, Penn Treebank, and online learning tasks.
Abstract
Adaptive gradient-based optimizers, notably Adam, have left their mark in training large-scale deep learning models, offering fast convergence and robustness to hyperparameter settings. However, they often struggle with generalization, attributed to their tendency to converge to sharp minima in the loss landscape. To address this, we propose a new memory-augmented version of Adam that encourages exploration towards flatter minima by incorporating a buffer of critical momentum terms during training. This buffer prompts the optimizer to overshoot beyond narrow minima, promoting exploration. Through comprehensive analysis in simple settings, we illustrate the efficacy of our approach in increasing exploration and bias towards flatter minima. We empirically demonstrate that it can improve model performance for image classification on ImageNet and CIFAR10/100, language modelling on Penn…
Peer Reviews
Decision·Submitted to ICLR 2024
The paper is well-written and clear, and uses both toy examples and real-world problems to demonstrate the effectiveness of the proposed method.
1. Prioritizing momenta with large gradient norms is not well motivated. It is unclear if and how this choice affects the performance of the method. 2. The paper seems to associate flat minima with global minima or lower loss values in some of the toy examples, which is not always the case in practice. Indeed, the training loss of a flat minimum is usually higher than that of a sharp minimum on real-world datasets. 3. The proposed method is partly motivated by the performance gap between SGD and
- The proposed method is well-motivated, and to the best of my knowledge, it presents a novel approach to optimization. - The presentation is lucid and easy to follow. - The paper offers comprehensive evidence to back the primary claims, including theoretical analysis, numerical simulation results with toy examples, and real-world practical experiments.
- (minor) The paper would be more convincing if larger-scale empirical experiments on NLP were conducted.
The preliminary experiments using artificial loss functions such as the Goldstein-Price loss function and Ackley loss function show a clear advantage of CM variants to explore and find the lower loss surface near the global solution.
Although the experiments using artificial loss functions are promising, the results on actual deep neural networks is not as good. The fact that both CM methods overfit after convergence for LSTM on PTB in Figure 6 raises a major concern, and contradicts the main claim of the paper that CM methods can find a flatter minima. The scores that are reported in this paper seem to be quite lower than what is reported in "papers with code". For example, EfficientNet-B0 on ImageNet should achieve a top-
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Neural Networks and Applications · Advanced Neural Network Applications
MethodsAdam
