Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers
Nikhil Nayak, Julia White, Urchade Zaratiana, Kelton Zhang, Henrijs Princis, Dhruv Atreja, Henry Fawcett, Matthew Thomas, George Hurn-Maloney, Ash Lewis

TL;DR
This paper identifies and corrects finite-sample biases in stochastic preconditioned optimizers for language models, leading to improved training performance.
Contribution
It introduces a bias-correction framework that addresses gradient-preconditioner coupling and nonlinear inversion biases, applicable to popular optimizers like AdamW, Sophia, and Shampoo.
Findings
Bias correction reduces held-out pretraining loss by up to 0.15 nats.
Bias correction effects are neutral-to-positive in downstream tasks.
Framework applies to various preconditioning methods and improves optimizer performance.
Abstract
Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
