Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

Wuyue Zhang; Chongdong Huang; Chunbo You; Cheng Gu; Fengjuan Wang; Mou Sun

arXiv:2603.02731·cs.LG·March 4, 2026

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

Wuyue Zhang, Chongdong Huang, Chunbo You, Cheng Gu, Fengjuan Wang, Mou Sun

PDF

Open Access

TL;DR

This paper introduces a practical FP4 training method for large-scale MoE models on Hopper GPUs, reducing memory and bandwidth usage while maintaining performance without native FP4 hardware support.

Contribution

The authors develop a novel FP8-to-FP4 quantization approach enabling efficient MoE training on Hopper GPUs without native FP4 support.

Findings

01

Achieves 14.8% reduction in peak activation memory

02

Improves training throughput by 12.5%

03

Maintains comparable performance to FP8 baselines at 671B parameters

Abstract

Training large-scale Mixture-of-Experts (MoE) models is bottlenecked by activation memory and expert-parallel communication, yet FP4 training remains impractical on Hopper-class GPUs without native MXFP4 or NVFP4 support. In this work, we present a training recipe that enables MXFP4 efficiency for MoE models on Hopper architectures without native 4-bit computation support. A central challenge is to integrate FP4 into an existing BF16/FP8 hybrid training pipeline without incurring costly precision round-trips (e.g., FP4 $\leftrightarrow$ BF16 $\leftrightarrow$ FP8). We address this challenge by introducing direct FP8-to-FP4 quantization and de-quantization, together with scaling-aware FP4 row-wise to column-wise conversion, enabling FP4 activations and expert-parallel communication with minimal overhead. Core MoE computations are executed in FP8, while activations and expert-parallel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Stochastic Gradient Optimization Techniques