Iterative Averaging in the Quest for Best Test Error
Diego Granziol, Xingchen Wan, Samuel Albanie, Stephen Roberts

TL;DR
This paper analyzes how iterate averaging improves generalization in high-dimensional models, deriving theoretical insights and proposing adaptive algorithms that outperform standard SGD on multiple datasets.
Contribution
The paper provides a theoretical explanation for iterate averaging's benefits and introduces two adaptive algorithms that enhance performance and reduce tuning.
Findings
Iterate averaging combined with large learning rates and regularization improves regularization.
Less frequent averaging is justified and effective.
Adaptive gradient methods benefit from iterate averaging, often outperforming non-adaptive methods.
Abstract
We analyse and explain the increased generalisation performance of iterate averaging using a Gaussian process perturbation model between the true and batch risk surface on the high dimensional quadratic. We derive three phenomena \latestEdits{from our theoretical results:} (1) The importance of combining iterate averaging (IA) with large learning rates and regularisation for improved regularisation. (2) Justification for less frequent averaging. (3) That we expect adaptive gradient methods to work equally well, or better, with iterate averaging than their non-adaptive counterparts. Inspired by these results\latestEdits{, together with} empirical investigations of the importance of appropriate regularisation for the solution diversity of the iterates, we propose two adaptive algorithms with iterate averaging. These give significantly better results compared to stochastic gradient descent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Sparse and Compressive Sensing Techniques
MethodsWeight Decay · Adam · Stochastic Gradient Descent
