FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

Fengjuan Wang; Zhiyi Su; Xingzhu Hu; Cheng Wang; Mou Sun

arXiv:2511.02302·cs.LG·November 5, 2025

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

Fengjuan Wang, Zhiyi Su, Xingzhu Hu, Cheng Wang, Mou Sun

PDF

Open Access

TL;DR

The paper introduces FP8-Flow-MoE, a novel FP8 training method for large MoE models that reduces memory and increases throughput by eliminating redundant casts and ensuring quantization consistency, while maintaining stability.

Contribution

It presents a new FP8-centric dataflow with scaling-aware operations that avoids double quantization errors and improves efficiency in large-scale MoE training.

Findings

01

Up to 21% higher throughput compared to BF16

02

16.5 GB lower memory usage per GPU

03

Stable convergence maintained in large models

Abstract

Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8's theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques