Characterization and Mitigation of Training Instabilities in Microscaling Formats

Huangyuan Su; Mujin Kwun; Stephanie Gil; Sham Kakade; Nikhil Anand

arXiv:2506.20752·cs.LG·June 27, 2025

Characterization and Mitigation of Training Instabilities in Microscaling Formats

Huangyuan Su, Mujin Kwun, Stephanie Gil, Sham Kakade, Nikhil Anand

PDF

Open Access 1 Repo

TL;DR

This paper investigates the training instabilities caused by Microscaling (MX) low-precision formats in large language models, revealing their causes and proposing stabilization strategies to enable efficient training at scale.

Contribution

It identifies the causes of stochastic instabilities in MX formats during training and proposes mitigation techniques to stabilize training and maintain performance.

Findings

01

Training in MX formats causes sharp, stochastic loss instabilities.

02

Gradient bias from quantization can trigger runaway divergence.

03

Hybrid precision schemes can stabilize training and match full-precision performance.

Abstract

Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware accelerators increasingly support lower-precision arithmetic formats, such as the Microscaling (MX) formats introduced in NVIDIA's Blackwell architecture. These formats use a shared scale within blocks of parameters to extend representable range and perform forward/backward GEMM operations in reduced precision for efficiency gains. In this work, we investigate the challenges and viability of block-scaled precision formats during model training. Across nearly one thousand language models trained from scratch -- spanning compute budgets from $2 \times 1 0^{17}$ to $4.8 \times 1 0^{19}$ FLOPs and sweeping over a broad range of weight-activation precision combinations -- we consistently observe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hither1/systems-scaling
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRheology and Fluid Dynamics Studies