When, Where and Why to Average Weights?

Niccol\`o Ajroldi; Antonio Orvieto; Jonas Geiping

arXiv:2502.06761·cs.LG·November 25, 2025

When, Where and Why to Average Weights?

Niccol\`o Ajroldi, Antonio Orvieto, Jonas Geiping

PDF

Open Access 1 Video

TL;DR

This paper extensively evaluates checkpoint averaging in deep learning, demonstrating its ability to accelerate training, improve generalization, and complement learning rate decay across multiple architectures and datasets.

Contribution

It provides a comprehensive benchmark of averaging techniques, analyzing their effects on training speed, generalization, and their interaction with learning rate schedules.

Findings

01

Averaging significantly accelerates training.

02

Averaging yields mild generalization improvements.

03

Combining averaging with learning rate decay enhances performance.

Abstract

Averaging checkpoints along the training trajectory is a simple yet powerful approach to improve the generalization performance of Machine Learning models and reduce training time. Motivated by these potential gains, and in an effort to fairly and thoroughly benchmark this technique, we present an extensive evaluation of averaging techniques in modern Deep Learning, which we perform using AlgoPerf \citep{dahl_benchmarking_2023}, a large-scale benchmark for optimization algorithms. We investigate whether weight averaging can reduce training time, improve generalization, and replace learning rate decay, as suggested by recent literature. Our evaluation across seven architectures and datasets reveals that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost, while mildly improving generalization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

When, Where and Why to Average Weights?· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and Data Classification