High-Rate Quantized Matrix Multiplication II

Or Ordentlich; Yury Polyanskiy

arXiv:2605.13768·cs.LG·May 14, 2026

High-Rate Quantized Matrix Multiplication II

Or Ordentlich, Yury Polyanskiy

PDF

TL;DR

This paper explores advanced quantization techniques for matrix multiplication in large language models, demonstrating near-optimal performance using waterfilling principles and analyzing basis-free schemes.

Contribution

It introduces a waterfilling-based approach to improve weight-only post-training quantization, showing near-optimality and basis independence in high-rate regimes.

Findings

01

WaterSIC scheme is basis free and close to the theoretical distortion limit.

02

GPTQ with random rotation performs near WaterSIC, within 0.1 bit for Llama-3-8B.

03

Waterfilling improves practical quantization algorithms for LLMs.

Abstract

This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_{X}$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.