MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling
Yu Zhang, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu

TL;DR
MOSS is a novel FP8 training framework that combines microscaling and automatic scaling to enable efficient, stable, and high-throughput training of large language models, matching BF16 performance.
Contribution
MOSS introduces a two-level microscaling strategy and automatic weight scaling, reducing overhead and improving efficiency in FP8 LLM training.
Findings
Achieves up to 34% higher training throughput.
Maintains performance comparable to BF16 baseline.
Enables stable training of 7B parameter models.
Abstract
Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training…
Peer Reviews
Decision·ICLR 2026 Poster
1. Automatic weight scaling (Adam’s bounded update) avoids real-time max-reduction, outperforming TE’s delayed scaling, which is novel and efficient. 2. Proofs (SNR, bounded updates) validate designs; experiments with clear metrics compare MOSS to BF16/COAT. 3. Well-structured framework with visualizations and detailed experimental setups for reproducibility. 4. Custom kernels enable MXFP8 on non-native hardware.
1. Limited originality of two-level microscaling: The strategy overlaps heavily with the MXFP standard (OCP’s microscaling format), which already defines tensor subblock partitioning and E8M0 local scale factors to optimize FP8’s dynamic range. The addition of a FP32 global scale is also used in NVFP format, limiting originality in this module. 2. Experimental gaps: Figure 5 (OLMo-7B pretraining loss) obscures the BF16 baseline curve for steps > 2000, precluding direct verification of MOSS’s cla
- Kernel‑aware two‑level microscaling keeps the GEMM inner loop on Tensor Cores and shifts dequantization to the epilogue; the mechanism is clearly illustrated. - Solid empirical parity with BF16 at 7B alongside better throughput. - The writing and figures are clear and the limitations section is candid about scope
- The paper focuses on throughput but does not report memory/communication gains, - MOSS’s GEMM is slower than DeepGEMM on several shapes (Table 4) - Longer runs in Appendix B report only MOSS
1. Clear motivation and practical relevance for FP8 LLM training. 2. Elegant two-level microscaling design balancing accuracy and efficiency. 3. Simple yet effective automatic scaling removing runtime overhead. 4. Strong empirical results: BF16-level accuracy, 34–47% faster. 5. Works on standard GPUs without special hardware support.
1. Evaluation is limited to mid-sized models (up to 7B parameters); scalability to larger settings (e.g., 30B–32B models) is not demonstrated. 2. The paper mainly reports throughput improvements, but does not deeply analyze memory, communication, or energy efficiency, which are also key for FP8 training. 3. While results are strong on core GEMM operations, extensions to other components (e.g., LayerNorm, activation functions, or optimizer states) remain unexplored.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Stochastic Gradient Optimization Techniques
