Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression
Xuheng Li, Quanquan Gu

TL;DR
This paper provides a theoretical analysis of exponential moving average (EMA) in stochastic gradient descent (SGD) for high-dimensional linear regression, revealing its variance reduction and bias decay properties.
Contribution
It establishes the first risk bounds for online SGD with EMA in linear regression, showing variance reduction and exponential bias decay, and introduces new proof techniques for averaging schemes.
Findings
SGD with EMA has lower variance error than standard SGD.
Bias error in SGD with EMA decays exponentially across eigen-subspaces.
The analysis applies to a broad class of averaging schemes.
Abstract
Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models, especially diffusion-based generative models. However, there have been few theoretical results explaining the effectiveness of EMA. In this paper, to better understand EMA, we establish the risk bound of online SGD with EMA for high-dimensional linear regression, one of the simplest overparameterized learning tasks that shares similarities with neural networks. Our results indicate that (i) the variance error of SGD with EMA is always smaller than that of SGD without averaging, and (ii) unlike SGD with iterate averaging from the beginning, the bias error of SGD with EMA decays exponentially in every eigen-subspace of the data covariance matrix. Additionally, we develop proof techniques applicable to the analysis of a broad class of averaging schemes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference
MethodsStochastic Gradient Descent
