ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash, Bakhtiari, Michael Wyatt, Reza Yazdani Aminabadi, Yuxiong He, Olatunji, Ruwase, Leon Song, Zhewei Yao

TL;DR
This paper introduces a novel FP6-centric quantization strategy for large language models, demonstrating superior accuracy and hardware efficiency over traditional INT4 methods across diverse generative tasks.
Contribution
The study proposes a 4+2 FP6 quantization scheme that enhances LLM performance and hardware compatibility, addressing limitations of existing 4-bit quantization methods.
Findings
FP6 outperforms INT4 in accuracy and versatility across tasks.
FP6 enables code generation performance comparable to FP16.
The 4+2 design achieves latency similar to INT4, improving hardware efficiency.
Abstract
This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications
