Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
Yijia Zhang, Lingran Zhao, Shijie Cao, Wenqiang Wang, Ting Cao, Fan, Yang, Mao Yang, Shanghang Zhang, Ningyi Xu

TL;DR
This paper compares low-bit integer and floating-point quantization for large language models, proposing a layer-wise mixed format approach (MoFQ) that improves performance and efficiency without hardware overhead.
Contribution
It introduces MoFQ, a layer-wise mixed format quantization method, demonstrating its effectiveness in outperforming single-format methods in LLM deployment.
Findings
MoFQ achieves state-of-the-art results in weight-only and weight-activation quantization.
MoFQ surpasses GPTQ in 4-bit weight-only quantization with faster speed.
MoFQ performs close to full precision in 8-bit weight-activation quantization.
Abstract
Efficient deployment of large language models (LLMs) necessitates low-bit quantization to minimize model size and inference cost. While low-bit integer formats (e.g., INT8/INT4) have been the conventional choice, emerging low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU. However, the superiority of low-bit INT versus FP formats for quantization on LLMs remains unclear. In this study, we conduct a comparative analysis of INT and FP quantization with the same bit-width, revealing that the optimal quantization format varies across different layers due to the complexity and diversity of tensor distribution. Consequently, we advocate the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis. This simple yet effective approach achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
