Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

Nikhil Nayak; Julia White; Urchade Zaratiana; Kelton Zhang; Henrijs Princis; Dhruv Atreja; Henry Fawcett; Matthew Thomas; George Hurn-Maloney; Ash Lewis

arXiv:2605.20756·cs.LG·May 21, 2026

Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

Nikhil Nayak, Julia White, Urchade Zaratiana, Kelton Zhang, Henrijs Princis, Dhruv Atreja, Henry Fawcett, Matthew Thomas, George Hurn-Maloney, Ash Lewis

PDF

TL;DR

This paper identifies and corrects finite-sample biases in stochastic preconditioned optimizers for language models, leading to improved training performance.

Contribution

It introduces a bias-correction framework that addresses gradient-preconditioner coupling and nonlinear inversion biases, applicable to popular optimizers like AdamW, Sophia, and Shampoo.

Findings

01

Bias correction reduces held-out pretraining loss by up to 0.15 nats.

02

Bias correction effects are neutral-to-positive in downstream tasks.

03

Framework applies to various preconditioning methods and improves optimizer performance.

Abstract

Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.