How to Scale Your EMA
Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko,, Eeshan Gunesh Dhekane, Xavier Suau, Russ Webb

TL;DR
This paper introduces a new scaling rule for optimizing Exponential Moving Average (EMA) models in machine learning, enabling consistent training dynamics across different batch sizes and improving efficiency in self-supervised learning.
Contribution
The authors propose a novel scaling rule for EMA optimization that maintains training dynamics across batch sizes and validate it across various architectures and modalities.
Findings
Validates the scaling rule across multiple architectures and data types.
Enables training of BYOL with batch sizes up to 24,576 without performance loss.
Achieves a 6× reduction in wall-clock time for SSL training.
Abstract
Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Anomaly Detection Techniques and Applications · Machine Learning and Data Classification
MethodsBootstrap Your Own Latent
