Improving Generalization Performance by Switching from Adam to SGD
Nitish Shirish Keskar, Richard Socher

TL;DR
This paper introduces SWATS, a hybrid optimization strategy that switches from Adam to SGD during training, significantly improving generalization performance across various benchmarks with minimal overhead.
Contribution
The paper proposes SWATS, a simple, low-overhead method that adaptively switches from Adam to SGD based on a projection-based trigger, enhancing generalization in neural network training.
Findings
SWATS closes the generalization gap between Adam and SGD on multiple benchmarks.
The switching strategy improves test accuracy and generalization performance.
Minimal additional computational overhead is introduced by the method.
Abstract
Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate. Concretely, we propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied. The condition we propose relates to the projection of Adam steps on the gradient subspace. By design, the monitoring process for this condition adds very little overhead and does not increase the number of hyperparameters in the optimizer. We report experiments on several standard benchmarks such as: ResNet, SENet, DenseNet and PyramidNet for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗H-D-T/Buzz-8b-Large-v0.5model· 16 dl· ♡ 2916 dl♡ 29
- 🤗LoneStriker/Buzz-8b-Large-v0.5-GGUFmodel· 14 dl· ♡ 114 dl♡ 1
- 🤗LoneStriker/Buzz-8b-Large-v0.5-3.0bpw-h6-exl2model· 4 dl4 dl
- 🤗LoneStriker/Buzz-8b-Large-v0.5-4.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/Buzz-8b-Large-v0.5-5.0bpw-h6-exl2model· 3 dl3 dl
- 🤗LoneStriker/Buzz-8b-Large-v0.5-6.0bpw-h6-exl2model· 3 dl3 dl
- 🤗LoneStriker/Buzz-8b-Large-v0.5-8.0bpw-h8-exl2model· 3 dl3 dl
- 🤗QuantFactory/Buzz-8b-Large-v0.5-GGUFmodel· 73 dl73 dl
- 🤗afrideva/Buzz-8b-Large-v0.5-GGUFmodel· 26 dl26 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsSigmoid Activation · Average Pooling · Concatenated Skip Connection · Squeeze-and-Excitation Block · Dense Block · Zero-padded Shortcut Connection · Pyramidal Residual Unit · Pyramidal Bottleneck Residual Unit · Dropout · SENet
