Block Rotation is All You Need for MXFP4 Quantization
Yuantian Shao, Peisong Wang, Yuanteng Chen, Chang Xu, Zhihui Wei, Jian Cheng

TL;DR
This paper investigates the challenges of applying rotation-based post-training quantization methods to the MXFP4 format in large language models, identifies fundamental incompatibilities, and proposes a simple block rotation strategy to improve accuracy.
Contribution
It provides the first comprehensive benchmark of PTQ methods under MXFP4, analyzes the root cause of rotation incompatibility, and introduces a novel block rotation approach for better quantization performance.
Findings
Rotation-based methods are incompatible with MXFP4 due to fundamental scaling mismatches.
GPTQ consistently performs well across different formats and models.
The proposed block rotation strategy significantly improves quantization accuracy for MXFP4.
Abstract
Large language models (LLMs) have achieved remarkable success, but their rapidly growing scale imposes prohibitive costs in memory, computation, and energy. Post-training quantization (PTQ) is a promising solution for efficient deployment, yet achieving accurate W4A4 quantization remains an open challenge. While most existing methods are designed for INT4 formats, the emergence of MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)-- raises questions about the applicability of current techniques. In this work, we establish a comprehensive benchmark of PTQ methods under the MXFP4 format. Through systematic evaluation, we find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4. We further provide the first in-depth…
Peer Reviews
Decision·Submitted to ICLR 2026
This paper establishes a comprehensive benchmark comparing state-of-the-art PTQ methods (GPTQ, SmoothQuant, QuaRot, SpinQuant) across multiple LLMs under MXFP4, providing strong empirical evidence and highlighting performance gaps in existing methods. This paper conducts a detailed analysis of the destructive interaction between rotation-based methods and MXFP4’s power-of-two scaling. Building on the identified issues, this paper proposes a simple yet effective Block-wise Rotation Quantization
1. The paper primarily focuses on the theoretical aspects of BRQ and MXFP4 quantization but lacks a detailed evaluation on real-world hardware deployment, such as latency, memory overhead, and computational cost. This leaves a gap in understanding how BRQ performs in practical settings. 2. The experiments predominantly focus on INT4 PTQ algorithms applied to MXFP4. However, the paper does not explore other quantization formats or different model sizes (e.g., INT8, mixed-precision), limiting the
- The paper is logically organized, providing a systematic analysis of the incompatibility between MXFP4 and rotation-based quantization methods, making the motivation and contributions easy to follow. - The authors evaluate across multiple mainstream LLMs (e.g., LLaMA-3 8B, Qwen2.5, Mistral 7B) and compare with various quantization baselines such as GPTQ, QuaRot+, and BINT4, using both perplexity and zero-shot benchmarks. - The proposed Block-Wise Rotation Quantization (BRQ) specifically addr
- MXFP4 is a block-wise quantization method, and adopting a block-wise rotation transform seems to be an intuitive idea. - While the proposed BRQ method is empirically effective, the paper lacks deeper theoretical justification or formal analysis explaining why block-wise rotation achieves better quantization stability. - In integer quantization, group-wise quantization is often required as well. Why does combining it with a rotation transform not harm accuracy. - The latest NVIDIA GPUs supp
1. The evaluation is comprehensive. It tests multiple models (LLaMA-2/3, Mistral, Qwen), includes both perplexity and downstream task accuracy, and compares against a wide range of strong baselines (GPTQ, OmniQuant, SpinQuant, etc.). The inclusion of a 70B model further strengthens the claims. 2. The explanation of how MXFP4's PoT scaling struggles with large values and how global rotation amplifies small values in regular blocks is clear, intuitive.
1. The central problem and its solution are a straightforward, expected outcome for anyone with deep expertise in quantization. Applying a block-level transformation to align with a block-level quantization scheme is a natural and almost trivial engineering adjustment, not a novel research contribution. The MXFP4 format, by design, uses local block scaling (PoT) to contain outliers. Applying a global operation that deliberately spreads out outlier energy directly counteracts the format's core de
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Speech Recognition and Synthesis
