Stochastic Weight Averaging Revisited
Hao Guo, Jiyong Jin, Bin Liu

TL;DR
This paper revisits stochastic weight averaging (SWA), analyzing its effects on neural network optimization, and introduces a new algorithm, PSWA, that leverages global geometric structures to improve model performance.
Contribution
The paper provides a detailed analysis of SWA's contributions, disentangles the effects of weight averaging and learning rate schedules, and proposes PSWA to better exploit loss landscape structures.
Findings
SWA helps discover wider optima but not always.
Weight averaging reduces variance in model weights.
PSWA outperforms standard SWA and SGD.
Abstract
Averaging neural network weights sampled by a backbone stochastic gradient descent (SGD) is a simple yet effective approach to assist the backbone SGD in finding better optima, in terms of generalization. From a statistical perspective, weight averaging (WA) contributes to variance reduction. Recently, a well-established stochastic weight averaging (SWA) method is proposed, which is featured by the application of a cyclical or high constant (CHC) learning rate schedule (LRS) in generating weight samples for WA. Then a new insight on WA appears, which states that WA helps to discover wider optima and then leads to better generalization. We conduct extensive experimental studies for SWA, involving a dozen modern DNN model structures and a dozen benchmark open-source image, graph, and text datasets. We disentangle contributions of the WA operation and the CHC LRS for SWA, showing that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification
MethodsStochastic Weight Averaging · Stochastic Gradient Descent
