Training LLMs with MXFP4
Albert Tseng, Tao Yu, Youngsuk Park

TL;DR
This paper introduces a near-lossless training method using MXFP4 low-precision data types with stochastic rounding and Hadamard transforms, enabling faster large language model training with minimal quality loss.
Contribution
The authors develop a novel training recipe that employs MXFP4 GEMMs with unbiased gradient estimates, achieving significant speedups while maintaining model quality.
Findings
MXFP4 GEMMs are 2x faster than FP8 on supported hardware.
The method achieves >1.3x speedup over FP8 and >1.7x over BF16 during backpropagation.
Minimal degradation in model quality when training GPT models up to 6.7B parameters.
Abstract
Low precision (LP) datatypes such as MXFP4 can accelerate matrix multiplications (GEMMs) and reduce training costs. However, directly using MXFP4 instead of BF16 during training significantly degrades model quality. In this work, we present the first near-lossless training recipe that uses MXFP4 GEMMs, which are faster than FP8 on supported hardware. Our key insight is to compute unbiased gradient estimates with stochastic rounding (SR), resulting in more accurate model updates. However, directly applying SR to MXFP4 can result in high variance from block-level outliers, harming convergence. To overcome this, we use the random Hadamard tranform to theoretically bound the variance of SR. We train GPT models up to 6.7B parameters and find that our method induces minimal degradation over mixed-precision BF16 training. Our recipe computes the training FLOPs in MXFP4,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
