Pretraining large language models with MXFP4 on Native FP4 Hardware
Musa Cim, Poovaiah Palangappa, Miro Hodak, Ravi Dwivedula, Meena Arunachalam, Mahmut Taylan Kandemir

TL;DR
This paper investigates why full-pipeline FP4 training of large language models often diverges, identifying weight gradient quantization as the main cause and demonstrating that deterministic Hadamard rotations can restore stability.
Contribution
It provides a controlled study of MXFP4 quantization effects in transformer training, highlighting the critical role of structured micro-scaling errors along gradient paths.
Findings
Quantizing Wgrad causes convergence issues in FP4 training.
Deterministic Hadamard rotations restore training stability.
Stochastic rounding and randomized rotations fail to stabilize training.
Abstract
Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In full pretraining of Llama 3.1-8B on the C4 dataset, we observe that quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. To interpret this behavior, we evaluate both structured and stochastic interventions under a controlled experimental setting. We find that stochastic rounding and randomized Hadamard rotations fail to stabilize training once Wgrad is quantized, whereas…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
