Recipes for Pre-training LLMs with MXFP8
Asit Mishra, Dusan Stosic, Simon Layton, Paulius Micikevicius

TL;DR
This paper explores the use of MXFP8 data formats for efficient pre-training of large language models, demonstrating that with proper parameter choices, models up to 8B parameters can be trained effectively on large datasets.
Contribution
It introduces the MXFP8-E4M3 datatype and a conversion algorithm, enabling effective training of large models with fewer bits without accuracy loss.
Findings
MXFP8 enables quantization of more tensors during training.
Models up to 8B parameters trained with MXFP8 match BF16 performance.
Efficient training on datasets of up to 15T tokens.
Abstract
Using fewer bits to represent model parameters and related tensors during pre-training has become a required technique for improving GPU efficiency without sacrificing accuracy. Microscaling (MX) formats introduced in NVIDIA Blackwell generation of GPUs represent a major advancement of this technique, making it practical to combine narrow floating-point data types with finer granularity per-block scaling factors. In turn, this enables both quantization of more tensors than previous approaches and more efficient execution of operations on those tensors. Effective use of MX-formats requires careful choices of various parameters. In this paper we review these choices and show how MXFP8-E4M3 datatype and a specific number conversion algorithm result in training sessions that match those carried out in BF16. We present results using models with up to 8B parameters, trained on high-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Tensor decomposition and applications
