Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM   Inference?

Cheng Zhang; Jianyi Cheng; Ilia Shumailov; George A. Constantinides,; and Yiren Zhao

arXiv:2310.05079·cs.LG·March 15, 2024

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Cheng Zhang, Jianyi Cheng, Ilia Shumailov, George A. Constantinides,, and Yiren Zhao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a block quantisation method for LLMs that significantly improves sub-8-bit inference efficiency, achieving near-lossless 4-bit models without re-training, by addressing numerical scaling offsets and distribution mismatches.

Contribution

It adapts block quantisation for LLMs, reducing scaling offsets and enabling nearly-lossless sub-8-bit quantisation without calibration or re-training.

Findings

01

6-bit LLMs achieve 19x arithmetic density and 5x memory density over float32

02

Surpass prior 8-bit quantisation by 2.5x in arithmetic density

03

Nearly-lossless 4-bit LLMs achieved on downstream tasks

Abstract

The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers. Block quantisations efficiently reduce the numerical scaling offsets solely from an arithmetic perspective, without additional treatments in the computational path. Our nearly-lossless quantised 6-bit LLMs achieve a $19 \times$ higher arithmetic density and $5 \times$ memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by $2.5 \times$ in arithmetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chengzhang-98/llm-mixed-q
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Ferroelectric and Negative Capacitance Devices