High-Rate Quantized Matrix Multiplication II
Or Ordentlich, Yury Polyanskiy

TL;DR
This paper explores advanced quantization techniques for matrix multiplication in large language models, demonstrating near-optimal performance using waterfilling principles and analyzing basis-free schemes.
Contribution
It introduces a waterfilling-based approach to improve weight-only post-training quantization, showing near-optimality and basis independence in high-rate regimes.
Findings
WaterSIC scheme is basis free and close to the theoretical distortion limit.
GPTQ with random rotation performs near WaterSIC, within 0.1 bit for Llama-3-8B.
Waterfilling improves practical quantization algorithms for LLMs.
Abstract
This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
