When, Where and Why to Average Weights?
Niccol\`o Ajroldi, Antonio Orvieto, Jonas Geiping

TL;DR
This paper extensively evaluates checkpoint averaging in deep learning, demonstrating its ability to accelerate training, improve generalization, and complement learning rate decay across multiple architectures and datasets.
Contribution
It provides a comprehensive benchmark of averaging techniques, analyzing their effects on training speed, generalization, and their interaction with learning rate schedules.
Findings
Averaging significantly accelerates training.
Averaging yields mild generalization improvements.
Combining averaging with learning rate decay enhances performance.
Abstract
Averaging checkpoints along the training trajectory is a simple yet powerful approach to improve the generalization performance of Machine Learning models and reduce training time. Motivated by these potential gains, and in an effort to fairly and thoroughly benchmark this technique, we present an extensive evaluation of averaging techniques in modern Deep Learning, which we perform using AlgoPerf \citep{dahl_benchmarking_2023}, a large-scale benchmark for optimization algorithms. We investigate whether weight averaging can reduce training time, improve generalization, and replace learning rate decay, as suggested by recent literature. Our evaluation across seven architectures and datasets reveals that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost, while mildly improving generalization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and Data Classification
