Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Pranshu Malviya; Gon\c{c}alo Mordido; Aristide Baratin; Reza; Babanezhad Harikandeh; Jerry Huang; Simon Lacoste-Julien; Razvan Pascanu,; Sarath Chandar

arXiv:2307.09638·cs.LG·June 19, 2024

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Pranshu Malviya, Gon\c{c}alo Mordido, Aristide Baratin, Reza, Babanezhad Harikandeh, Jerry Huang, Simon Lacoste-Julien, Razvan Pascanu,, Sarath Chandar

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a memory-augmented Adam optimizer that promotes exploration of flatter minima, leading to improved generalization in deep learning models across various tasks.

Contribution

A novel memory-augmented Adam optimizer that incorporates critical momentum buffers to encourage exploration of flatter minima, enhancing model generalization.

Findings

01

Improves exploration towards flatter minima in simple settings.

02

Enhances model performance on ImageNet, CIFAR, Penn Treebank, and online learning tasks.

Abstract

Adaptive gradient-based optimizers, notably Adam, have left their mark in training large-scale deep learning models, offering fast convergence and robustness to hyperparameter settings. However, they often struggle with generalization, attributed to their tendency to converge to sharp minima in the loss landscape. To address this, we propose a new memory-augmented version of Adam that encourages exploration towards flatter minima by incorporating a buffer of critical momentum terms during training. This buffer prompts the optimizer to overshoot beyond narrow minima, promoting exploration. Through comprehensive analysis in simple settings, we illustrate the efficacy of our approach in increasing exploration and bias towards flatter minima. We empirically demonstrate that it can improve model performance for image classification on ImageNet and CIFAR10/100, language modelling on Penn…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

The paper is well-written and clear, and uses both toy examples and real-world problems to demonstrate the effectiveness of the proposed method.

Weaknesses

1. Prioritizing momenta with large gradient norms is not well motivated. It is unclear if and how this choice affects the performance of the method. 2. The paper seems to associate flat minima with global minima or lower loss values in some of the toy examples, which is not always the case in practice. Indeed, the training loss of a flat minimum is usually higher than that of a sharp minimum on real-world datasets. 3. The proposed method is partly motivated by the performance gap between SGD and

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 1

Strengths

- The proposed method is well-motivated, and to the best of my knowledge, it presents a novel approach to optimization. - The presentation is lucid and easy to follow. - The paper offers comprehensive evidence to back the primary claims, including theoretical analysis, numerical simulation results with toy examples, and real-world practical experiments.

Weaknesses

- (minor) The paper would be more convincing if larger-scale empirical experiments on NLP were conducted.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

The preliminary experiments using artificial loss functions such as the Goldstein-Price loss function and Ackley loss function show a clear advantage of CM variants to explore and find the lower loss surface near the global solution.

Weaknesses

Although the experiments using artificial loss functions are promising, the results on actual deep neural networks is not as good. The fact that both CM methods overfit after convergence for LSTM on PTB in Figure 6 raises a major concern, and contradicts the main claim of the paper that CM methods can find a flatter minima. The scores that are reported in this paper seem to be quite lower than what is reported in "papers with code". For example, EfficientNet-B0 on ImageNet should achieve a top-

Code & Models

Repositories

chandar-lab/cmoptimizer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Neural Networks and Applications · Advanced Neural Network Applications

MethodsAdam