Understanding SGD with Exponential Moving Average: A Case Study in   Linear Regression

Xuheng Li; Quanquan Gu

arXiv:2502.14123·cs.LG·February 21, 2025

Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression

Xuheng Li, Quanquan Gu

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of exponential moving average (EMA) in stochastic gradient descent (SGD) for high-dimensional linear regression, revealing its variance reduction and bias decay properties.

Contribution

It establishes the first risk bounds for online SGD with EMA in linear regression, showing variance reduction and exponential bias decay, and introduces new proof techniques for averaging schemes.

Findings

01

SGD with EMA has lower variance error than standard SGD.

02

Bias error in SGD with EMA decays exponentially across eigen-subspaces.

03

The analysis applies to a broad class of averaging schemes.

Abstract

Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models, especially diffusion-based generative models. However, there have been few theoretical results explaining the effectiveness of EMA. In this paper, to better understand EMA, we establish the risk bound of online SGD with EMA for high-dimensional linear regression, one of the simplest overparameterized learning tasks that shares similarities with neural networks. Our results indicate that (i) the variance error of SGD with EMA is always smaller than that of SGD without averaging, and (ii) unlike SGD with iterate averaging from the beginning, the bias error of SGD with EMA decays exponentially in every eigen-subspace of the data covariance matrix. Additionally, we develop proof techniques applicable to the analysis of a broad class of averaging schemes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference

MethodsStochastic Gradient Descent