QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng, Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

TL;DR
This paper introduces QUIK, a hybrid 4-bit quantization method for large language models that significantly reduces inference costs while maintaining accuracy, enabling faster and more efficient generative AI applications.
Contribution
QUIK is the first hybrid 4-bit quantization approach that effectively compresses weights and activations for large LLMs, achieving practical speedups and high accuracy.
Findings
Up to 3.4x throughput improvement over FP16
Effective quantization for models like LLaMA, OPT, Falcon
Successful inference with 2:4 sparsity and quantization
Abstract
Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios, such as batched inference or prompt processing. In this paper, we address the general quantization problem, where both weights and activations should be quantized. We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy. We achieve this via a hybrid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning and Algorithms
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Softmax · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · Dropout · Residual Connection
