TL;DR
MicroMix introduces a mixed-precision quantization method and GEMM kernel tailored for NVIDIA's Blackwell architecture, achieving near-FP16 performance and significant acceleration for large language models.
Contribution
It proposes a co-designed quantization algorithm and GEMM kernel using Microscaling formats, optimizing accuracy and efficiency for large language models on new hardware.
Findings
Achieves near-FP16 performance on Llama and Qwen models.
Attains 2.29-3.38x acceleration over TensorRT-FP16 on GPUs.
Maintains lossless accuracy on several benchmarks.
Abstract
Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA's Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and GEMM kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
