Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov; Dmitrii Podoprikhin; Timur Garipov; Dmitry Vetrov,; Andrew Gordon Wilson

arXiv:1803.05407·cs.LG·February 26, 2019·227 cites

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov,, Andrew Gordon Wilson

PDF

Open Access 5 Repos

TL;DR

This paper introduces Stochastic Weight Averaging (SWA), a simple method that averages multiple points along SGD trajectories with cyclical or constant learning rates, leading to flatter solutions and improved generalization in deep neural networks.

Contribution

The paper proposes SWA, a novel averaging technique that enhances generalization and finds flatter minima, outperforming traditional SGD in deep neural network training.

Findings

01

SWA achieves higher test accuracy than SGD on various architectures.

02

SWA finds flatter minima associated with better generalization.

03

SWA is easy to implement with minimal computational overhead.

Abstract

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning

MethodsConcatenated Skip Connection · Dense Block · XRP Customer Service Number +1-833-534-1729 · Average Pooling · Zero-padded Shortcut Connection · Pyramidal Residual Unit · Pyramidal Bottleneck Residual Unit · Dropout · Dense Connections · Cosine Annealing