MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

Wenyuan Liu; Haoqian Meng; Yilun Luo; Yafei Zhao; Peng Zhang; Xindian Ma

arXiv:2508.02343·cs.LG·March 31, 2026

MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

Wenyuan Liu, Haoqian Meng, Yilun Luo, Yafei Zhao, Peng Zhang, Xindian Ma

PDF

1 Repo 1 Video

TL;DR

MicroMix introduces a mixed-precision quantization method and GEMM kernel tailored for NVIDIA's Blackwell architecture, achieving near-FP16 performance and significant acceleration for large language models.

Contribution

It proposes a co-designed quantization algorithm and GEMM kernel using Microscaling formats, optimizing accuracy and efficiency for large language models on new hardware.

Findings

01

Achieves near-FP16 performance on Llama and Qwen models.

02

Attains 2.29-3.38x acceleration over TensorRT-FP16 on GPUs.

03

Maintains lossless accuracy on several benchmarks.

Abstract

Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA's Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and GEMM kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lwy2020/MicroMix
github

Videos

MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models· slideslive