EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes

Adam Block; Cyril Zhang

arXiv:2508.00180·cs.LG·August 4, 2025

EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes

Adam Block, Cyril Zhang

PDF

Open Access 3 Reviews

TL;DR

The paper introduces BEMA, a bias-corrected exponential moving average method that improves stability and convergence in language model fine-tuning by eliminating bias inherent in traditional EMA.

Contribution

BEMA extends EMA by removing bias, offering provable acceleration and improved performance in language model fine-tuning.

Findings

01

BEMA outperforms standard EMA and vanilla training in convergence speed.

02

BEMA achieves higher final performance on language model benchmarks.

03

Theoretically demonstrates acceleration over existing methods.

Abstract

Stochasticity in language model fine-tuning, often caused by the small batch sizes typically used in this regime, can destabilize training by introducing large oscillations in generation quality. A popular approach to mitigating this instability is to take an Exponential moving average (EMA) of weights throughout training. While EMA reduces stochasticity, thereby smoothing training, the introduction of bias from old iterates often creates a lag in optimization relative to vanilla training. In this work, we propose the Bias-Corrected Exponential Moving Average (BEMA), a simple and practical augmentation of EMA that retains variance-reduction benefits while eliminating bias. BEMA is motivated by a simple theoretical model wherein we demonstrate provable acceleration of BEMA over both a standard EMA and vanilla training. Through an extensive suite of experiments on Language Models, we show…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

Below are some strengths I find in this paper: - The authors identify an important issue (the "lag" problem) in traditional EMA when fine-tuning large language models (LMs) with small batch sizes, I think this is a relevant and practical concern in deep learning - resolving which can have a big impact on the quality of downstream applications resulting out of fine-tuning models. - Stronger theoretical grounding (based on Ornstein-Uhlenbeck process in quadratic optimization) that shows provable a

Weaknesses

Despite the strengths, below are some comments I have on areas of imprvoements and some limitations I see: - Despite the strong theoretical backing using the OU process, the main assumption of having a noisy quadratic model is bit too simplistic in my opinion. It is not clear to me whether the findings nicely carry over to the more complex landscape observed while training real-world Deep Neural Networks and LLMs. - Not clear how much hyper-parameter tuning went into making this method work. If

Reviewer 02Rating 4Confidence 3

Strengths

Strengths: * The theoritical results are clearly presented and provides a strong foundation for motivating the proposed method. * The proposed method is simple to implement and achieves better performance compared with mutiple baselines on accuracy and convergence speed.

Weaknesses

Weaknesses & Questions: * The motivation of the method is somewhat confusing. The paper begins by stating that closed-loop training requires stabilizers and then proposes BEMA. However, in the supervised fine-tuning setup that this paper focuses on, the training process is not a closed loop scenario. In SFT, the model is trained with a teacher-forcing mechanism, where during the forward pass the next token is predicted based on the ground-truth of the previous token. This differs from autoregre

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper provides a mathematically analysis of EMA under OU dynamics, formally decomposing the bias–variance tradeoff and offering a closed-form correction term. 2. Results show that BEMA consistently improves stability and early-stage convergence compared to vanilla EMA.

Weaknesses

1. A key limitation of this paper is the lack of comparison with modern adaptive optimizers such as Adam or AdamW. All experiments are conducted under the SGD + EMA setting, which is rarely used in contemporary large language model (LLM) fine-tuning. Since Adam itself maintains exponential moving averages of both the first and second moments of gradients, it already implicitly addresses part of the variance–bias tradeoff that the paper analyzes. Therefore, it remains unclear whether the proposed

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis