TORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization
Zukang Xu, Xing Hu, and Dawei Yang

TL;DR
TORQ is a novel, training-free post-training quantization method that optimizes MXFP4 activation quantization for large language models by addressing structural imbalances, significantly improving accuracy.
Contribution
It introduces TORQ, a two-level orthogonal rotation framework that reshapes activation space to enhance MXFP4 quantization without additional training.
Findings
TORQ reduces perplexity on WikiText from 7.61 to 8.43 for Qwen3-32B.
Accuracy increases from 38.40% to 73.63% with TORQ, approaching BF16 performance.
TORQ outperforms existing methods in MXFP4 activation quantization for large language models.
Abstract
As Large Language Models (LLMs) advance toward practical deployment, the Microscaling FP4 (MXFP4) format has emerged as a cornerstone for next-generation low-bit inference, owing to its ability to balance high dynamic range with hardware efficiency. However, directly applying MXFP4 to LLM activation quantization inevitably leads to significant accuracy degradation. In this paper, we theoretically analyze the error structure of MXFP4 activation quantization, revealing that the root cause of this performance drop lies in two structural imbalances between activation distributions and the MXFP4 block floating-point format: (1) extreme inter-block variance imbalance and (2) intra-block codebook utilization imbalance. To address these challenges, we propose TORQ (Two-level Orthogonal Rotation for MXFP4 Quantization), a training-free Post-Training Quantization (PTQ) framework designed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
