Improving Generalization Performance by Switching from Adam to SGD

Nitish Shirish Keskar; Richard Socher

arXiv:1712.07628·cs.LG·December 21, 2017·403 cites

Improving Generalization Performance by Switching from Adam to SGD

Nitish Shirish Keskar, Richard Socher

PDF

Open Access 5 Repos 9 Models 5 Datasets

TL;DR

This paper introduces SWATS, a hybrid optimization strategy that switches from Adam to SGD during training, significantly improving generalization performance across various benchmarks with minimal overhead.

Contribution

The paper proposes SWATS, a simple, low-overhead method that adaptively switches from Adam to SGD based on a projection-based trigger, enhancing generalization in neural network training.

Findings

01

SWATS closes the generalization gap between Adam and SGD on multiple benchmarks.

02

The switching strategy improves test accuracy and generalization performance.

03

Minimal additional computational overhead is introduced by the method.

Abstract

Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate. Concretely, we propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied. The condition we propose relates to the projection of Adam steps on the gradient subspace. By design, the monitoring process for this condition adds very little overhead and does not increase the number of hyperparameters in the optimizer. We report experiments on several standard benchmarks such as: ResNet, SENet, DenseNet and PyramidNet for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsSigmoid Activation · Average Pooling · Concatenated Skip Connection · Squeeze-and-Excitation Block · Dense Block · Zero-padded Shortcut Connection · Pyramidal Residual Unit · Pyramidal Bottleneck Residual Unit · Dropout · SENet