MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

Yu Zhang; Hui-Ling Zhen; Mingxuan Yuan; Bei Yu

arXiv:2511.05811·cs.LG·December 8, 2025

MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

Yu Zhang, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu

PDF

Open Access 3 Reviews

TL;DR

MOSS is a novel FP8 training framework that combines microscaling and automatic scaling to enable efficient, stable, and high-throughput training of large language models, matching BF16 performance.

Contribution

MOSS introduces a two-level microscaling strategy and automatic weight scaling, reducing overhead and improving efficiency in FP8 LLM training.

Findings

01

Achieves up to 34% higher training throughput.

02

Maintains performance comparable to BF16 baseline.

03

Enables stable training of 7B parameter models.

Abstract

Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

1. Automatic weight scaling (Adam’s bounded update) avoids real-time max-reduction, outperforming TE’s delayed scaling, which is novel and efficient. 2. Proofs (SNR, bounded updates) validate designs; experiments with clear metrics compare MOSS to BF16/COAT. 3. Well-structured framework with visualizations and detailed experimental setups for reproducibility. 4. Custom kernels enable MXFP8 on non-native hardware.

Weaknesses

1. Limited originality of two-level microscaling: The strategy overlaps heavily with the MXFP standard (OCP’s microscaling format), which already defines tensor subblock partitioning and E8M0 local scale factors to optimize FP8’s dynamic range. The addition of a FP32 global scale is also used in NVFP format, limiting originality in this module. 2. Experimental gaps: Figure 5 (OLMo-7B pretraining loss) obscures the BF16 baseline curve for steps > 2000, precluding direct verification of MOSS’s cla

Reviewer 02Rating 6Confidence 3

Strengths

- Kernel‑aware two‑level microscaling keeps the GEMM inner loop on Tensor Cores and shifts dequantization to the epilogue; the mechanism is clearly illustrated. - Solid empirical parity with BF16 at 7B alongside better throughput. - The writing and figures are clear and the limitations section is candid about scope

Weaknesses

- The paper focuses on throughput but does not report memory/communication gains, - MOSS’s GEMM is slower than DeepGEMM on several shapes (Table 4) - Longer runs in Appendix B report only MOSS

Reviewer 03Rating 6Confidence 3

Strengths

1. Clear motivation and practical relevance for FP8 LLM training. 2. Elegant two-level microscaling design balancing accuracy and efficiency. 3. Simple yet effective automatic scaling removing runtime overhead. 4. Strong empirical results: BF16-level accuracy, 34–47% faster. 5. Works on standard GPUs without special hardware support.

Weaknesses

1. Evaluation is limited to mid-sized models (up to 7B parameters); scalability to larger settings (e.g., 30B–32B models) is not demonstrated. 2. The paper mainly reports throughput improvements, but does not deeply analyze memory, communication, or energy efficiency, which are also key for FP8 training. 3. While results are strong on core GEMM operations, extensions to other components (e.g., LayerNorm, activation functions, or optimizer states) remain unexplored.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Stochastic Gradient Optimization Techniques