Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs
Wuyue Zhang, Chongdong Huang, Chunbo You, Cheng Gu, Fengjuan Wang, Mou Sun

TL;DR
This paper introduces a practical FP4 training method for large-scale MoE models on Hopper GPUs, reducing memory and bandwidth usage while maintaining performance without native FP4 hardware support.
Contribution
The authors develop a novel FP8-to-FP4 quantization approach enabling efficient MoE training on Hopper GPUs without native FP4 support.
Findings
Achieves 14.8% reduction in peak activation memory
Improves training throughput by 12.5%
Maintains comparable performance to FP8 baselines at 671B parameters
Abstract
Training large-scale Mixture-of-Experts (MoE) models is bottlenecked by activation memory and expert-parallel communication, yet FP4 training remains impractical on Hopper-class GPUs without native MXFP4 or NVFP4 support. In this work, we present a training recipe that enables MXFP4 efficiency for MoE models on Hopper architectures without native 4-bit computation support. A central challenge is to integrate FP4 into an existing BF16/FP8 hybrid training pipeline without incurring costly precision round-trips (e.g., FP4 BF16 FP8). We address this challenge by introducing direct FP8-to-FP4 quantization and de-quantization, together with scaling-aware FP4 row-wise to column-wise conversion, enabling FP4 activations and expert-parallel communication with minimal overhead. Core MoE computations are executed in FP8, while activations and expert-parallel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Stochastic Gradient Optimization Techniques
