INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Mengzhao Chen; Meng Wu; Hui Jin; Zhihang Yuan; Jing Liu; Chaoyi Zhang; Yunshui Li; Jie Huang; Jin Ma; Zeyue Xue; Zhiheng Liu; Xingyan Bin; Ping Luo

arXiv:2510.25602·cs.LG·October 30, 2025

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo

PDF

TL;DR

This study systematically compares floating-point and integer quantization formats for AI hardware, revealing that fine-grained integer formats like MXINT8 often outperform FP formats in accuracy and efficiency, challenging current industry trends.

Contribution

It provides a comprehensive analysis of FP versus INT quantization trade-offs, introduces a symmetric clipping method for INT training, and offers guidance for hardware-software co-design in AI accelerators.

Findings

01

MXINT8 outperforms FP in accuracy and efficiency at 8-bit fine-grained quantization.

02

FP formats have an advantage at 4-bit, but NVINT4 can surpass NVFP4 with outlier mitigation.

03

Symmetric clipping enables nearly lossless INT8 training, improving practical deployment.

Abstract

Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.