Averaging Weights Leads to Wider Optima and Better Generalization
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov,, Andrew Gordon Wilson

TL;DR
This paper introduces Stochastic Weight Averaging (SWA), a simple method that averages multiple points along SGD trajectories with cyclical or constant learning rates, leading to flatter solutions and improved generalization in deep neural networks.
Contribution
The paper proposes SWA, a novel averaging technique that enhances generalization and finds flatter minima, outperforming traditional SGD in deep neural network training.
Findings
SWA achieves higher test accuracy than SGD on various architectures.
SWA finds flatter minima associated with better generalization.
SWA is easy to implement with minimal computational overhead.
Abstract
Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
MethodsConcatenated Skip Connection · Dense Block · XRP Customer Service Number +1-833-534-1729 · Average Pooling · Zero-padded Shortcut Connection · Pyramidal Residual Unit · Pyramidal Bottleneck Residual Unit · Dropout · Dense Connections · Cosine Annealing
