LLM-FP4: 4-Bit Floating-Point Quantized Transformers

Shih-yang Liu; Zechun Liu; Xijie Huang; Pingcheng Dong; Kwang-Ting; Cheng

arXiv:2310.16836·cs.CL·April 30, 2024·2 cites

LLM-FP4: 4-Bit Floating-Point Quantized Transformers

Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting, Cheng

PDF

Open Access 1 Repo

TL;DR

This paper introduces LLM-FP4, a novel 4-bit floating-point quantization method for large language models that improves performance and flexibility over traditional integer-based quantization, enabling efficient deployment of quantized LLMs.

Contribution

The paper presents a new 4-bit floating-point quantization approach for weights and activations in LLMs, including a strong baseline and a novel per-channel activation quantization technique.

Findings

01

Quantizes LLaMA-13B to 4-bit with minimal performance loss

02

Achieves 63.1 accuracy on zero-shot reasoning tasks

03

Outperforms previous state-of-the-art by 12.7 points

Abstract

We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner. Existing post-training quantization (PTQ) solutions are primarily integer-based and struggle with bit widths below 8 bits. Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions, and it has emerged as a default choice in many hardware platforms. One characteristic of FP quantization is that its performance largely depends on the choice of exponent bits and clipping range. In this regard, we construct a strong FP-PTQ baseline by searching for the optimal quantization parameters. Furthermore, we observe a high inter-channel variance and low intra-channel variance pattern in activation distributions, which adds activation quantization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nbasyl/llm-fp4
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Adam · Attention Dropout · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Byte Pair Encoding · WordPiece · Dropout