Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets
Kristi Topollai, Anna Choromanska

TL;DR
This paper investigates how quantization affects optimizer states in large language model pre-training, revealing state staleness issues and proposing reset strategies to improve efficiency and performance.
Contribution
It introduces a predictive model of optimizer state stalling due to quantization and develops a theory-guided method for optimal reset scheduling in low-precision training.
Findings
Reset schedules can recover performance lost to low-precision optimizer states.
Quantization causes optimizer states to become effectively stale, slowing adaptation.
Proper reset timing significantly reduces memory usage while maintaining model performance.
Abstract
Quantizing optimizer states is becoming an important ingredient of memory-efficient large-scale pre-training, but the resulting optimizer dynamics remain only partially understood. We study low-precision exponential moving average (EMA) optimizer states and show how quantization can cause many nominal updates to round back to the same stored value, making the state effectively stale and slowing adaptation beyond what the nominal decay would suggest. We then develop a simple predictive model of stalling that estimates one-step stalling probabilities and characterizes how stalling builds up over time after the initialization. This perspective provides a mechanistic explanation for why optimizer-state resets help in low precision: once a quantized EMA becomes effectively stale, resetting it can temporarily restore responsiveness. Motivated by this picture, we derive a simple theory-guided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Memory and Neural Computing · Neural Networks and Reservoir Computing
