FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
Fengjuan Wang, Zhiyi Su, Xingzhu Hu, Cheng Wang, Mou Sun

TL;DR
The paper introduces FP8-Flow-MoE, a novel FP8 training method for large MoE models that reduces memory and increases throughput by eliminating redundant casts and ensuring quantization consistency, while maintaining stability.
Contribution
It presents a new FP8-centric dataflow with scaling-aware operations that avoids double quantization errors and improves efficiency in large-scale MoE training.
Findings
Up to 21% higher throughput compared to BF16
16.5 GB lower memory usage per GPU
Stable convergence maintained in large models
Abstract
Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8's theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques
