QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language   Models

Saleh Ashkboos; Ilia Markov; Elias Frantar; Tingxuan Zhong; Xincheng; Wang; Jie Ren; Torsten Hoefler; Dan Alistarh

arXiv:2310.09259·cs.LG·November 3, 2023·2 cites

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng, Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

PDF

Open Access 1 Repo

TL;DR

This paper introduces QUIK, a hybrid 4-bit quantization method for large language models that significantly reduces inference costs while maintaining accuracy, enabling faster and more efficient generative AI applications.

Contribution

QUIK is the first hybrid 4-bit quantization approach that effectively compresses weights and activations for large LLMs, achieving practical speedups and high accuracy.

Findings

01

Up to 3.4x throughput improvement over FP16

02

Effective quantization for models like LLaMA, OPT, Falcon

03

Successful inference with 2:4 sparsity and quantization

Abstract

Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios, such as batched inference or prompt processing. In this paper, we address the general quantization problem, where both weights and activations should be quantized. We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy. We achieve this via a hybrid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ist-daslab/quik
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning and Algorithms

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Softmax · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · Dropout · Residual Connection