How to Scale Your EMA

Dan Busbridge; Jason Ramapuram; Pierre Ablin; Tatiana Likhomanenko,; Eeshan Gunesh Dhekane; Xavier Suau; Russ Webb

arXiv:2307.13813·stat.ML·November 8, 2023

How to Scale Your EMA

Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko,, Eeshan Gunesh Dhekane, Xavier Suau, Russ Webb

PDF

Open Access 1 Video

TL;DR

This paper introduces a new scaling rule for optimizing Exponential Moving Average (EMA) models in machine learning, enabling consistent training dynamics across different batch sizes and improving efficiency in self-supervised learning.

Contribution

The authors propose a novel scaling rule for EMA optimization that maintains training dynamics across batch sizes and validate it across various architectures and modalities.

Findings

01

Validates the scaling rule across multiple architectures and data types.

02

Enables training of BYOL with batch sizes up to 24,576 without performance loss.

03

Achieves a 6× reduction in wall-clock time for SSL training.

Abstract

Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

How to Scale Your EMA· slideslive

Taxonomy

TopicsNeural Networks and Applications · Anomaly Detection Techniques and Applications · Machine Learning and Data Classification

MethodsBootstrap Your Own Latent