Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

Jatin Chhugani; Geonhwa Jeong; Bor-Yiing Su; Yunjie Pan; Hanmei Yang; Aayush Ankit; Jiecao Yu; Summer Deng; Yunqing Chen; Nadathur Satish; Changkyu Kim

arXiv:2603.08713·cs.AR·March 11, 2026

Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

Jatin Chhugani, Geonhwa Jeong, Bor-Yiing Su, Yunjie Pan, Hanmei Yang, Aayush Ankit, Jiecao Yu, Summer Deng, Yunqing Chen, Nadathur Satish, Changkyu Kim

PDF

Open Access

TL;DR

This paper presents two software techniques, OAS and MBS, that significantly improve MXFP4 quantization accuracy for large language models, making it a practical, hardware-efficient alternative to NVIDIA's NVFP4 with minimal performance overhead.

Contribution

Introduction of Overflow-Aware Scaling and Macro Block Scaling methods that enhance MXFP4 quantization fidelity without hardware modifications.

Findings

01

Reduced accuracy gap between MXFP4 and NVFP4 from 10% to below 1%.

02

Achieved near-NVFP4 accuracy with only 6.2% GEMM overhead.

03

Enabled MXFP4 to match NVFP4 performance while maintaining hardware efficiency.

Abstract

Large Language Models (LLMs) have intensified the need for low-precision formats that enable efficient, large-scale inference. The Open Compute Project (OCP) Microscaling (MX) standard is attractive due to its favorable hardware efficiency, but its 4-bit variant (MXFP4) lags behind NVIDIA's NVFP4 in accuracy, limiting adoption. We introduce two software-only techniques, Overflow-Aware Scaling (OAS) and Macro Block Scaling (MBS), that improve MXFP4 quantization fidelity without requiring hardware changes. OAS reduces overall errors by increasing effective dynamic range under power-of-two block scaling, while MBS allocates higher-precision scaling at a coarser granularity to better preserve outliers. Across multiple LLMs and standard downstream benchmarks, OAS and MBS reduce the end-to-end accuracy gap between MXFP4 and NVFP4 from about 10% to below 1% on average, while incurring modest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Embedded Systems Design Techniques