LLM-FP4: 4-Bit Floating-Point Quantized Transformers
Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting, Cheng

TL;DR
This paper introduces LLM-FP4, a novel 4-bit floating-point quantization method for large language models that improves performance and flexibility over traditional integer-based quantization, enabling efficient deployment of quantized LLMs.
Contribution
The paper presents a new 4-bit floating-point quantization approach for weights and activations in LLMs, including a strong baseline and a novel per-channel activation quantization technique.
Findings
Quantizes LLaMA-13B to 4-bit with minimal performance loss
Achieves 63.1 accuracy on zero-shot reasoning tasks
Outperforms previous state-of-the-art by 12.7 points
Abstract
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner. Existing post-training quantization (PTQ) solutions are primarily integer-based and struggle with bit widths below 8 bits. Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions, and it has emerged as a default choice in many hardware platforms. One characteristic of FP quantization is that its performance largely depends on the choice of exponent bits and clipping range. In this regard, we construct a strong FP-PTQ baseline by searching for the optimal quantization parameters. Furthermore, we observe a high inter-channel variance and low intra-channel variance pattern in activation distributions, which adds activation quantization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Adam · Attention Dropout · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Byte Pair Encoding · WordPiece · Dropout
