Characterization and Mitigation of Training Instabilities in Microscaling Formats
Huangyuan Su, Mujin Kwun, Stephanie Gil, Sham Kakade, Nikhil Anand

TL;DR
This paper investigates the training instabilities caused by Microscaling (MX) low-precision formats in large language models, revealing their causes and proposing stabilization strategies to enable efficient training at scale.
Contribution
It identifies the causes of stochastic instabilities in MX formats during training and proposes mitigation techniques to stabilize training and maintain performance.
Findings
Training in MX formats causes sharp, stochastic loss instabilities.
Gradient bias from quantization can trigger runaway divergence.
Hybrid precision schemes can stabilize training and match full-precision performance.
Abstract
Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware accelerators increasingly support lower-precision arithmetic formats, such as the Microscaling (MX) formats introduced in NVIDIA's Blackwell architecture. These formats use a shared scale within blocks of parameters to extend representable range and perform forward/backward GEMM operations in reduced precision for efficiency gains. In this work, we investigate the challenges and viability of block-scaled precision formats during model training. Across nearly one thousand language models trained from scratch -- spanning compute budgets from to FLOPs and sweeping over a broad range of weight-activation precision combinations -- we consistently observe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRheology and Fluid Dynamics Studies
