High-Rate Quantized Matrix Multiplication I

Or Ordentlich; Yury Polyanskiy

arXiv:2601.17187·cs.IT·May 14, 2026

High-Rate Quantized Matrix Multiplication I

Or Ordentlich, Yury Polyanskiy

PDF

TL;DR

This paper explores quantized matrix multiplication for efficient large language model deployment, analyzing theoretical tradeoffs and practical schemes, with a focus on high-rate quantization and weight-only quantization scenarios.

Contribution

It provides a theoretical framework for quantization tradeoffs and evaluates popular schemes without prior statistical calibration, including new heuristic approximations.

Findings

01

High-rate theory characterizes quantization-distortion tradeoffs.

02

Absmax INT and FP quantization schemes are compared with heuristic models.

03

Second-order statistics enable improved weight-only quantization.

Abstract

This paper investigates the problem of quantized matrix multiplication (MatMul), which has become crucial for the efficient deployment of large language models (LLMs). We consider a Generic MatMul setting, where both matrices must be quantized (weight+activation quantization) without specific apriori (calibration) statistical information about the factors. We review the fundamental information-theoretic tradeoff between quantization rate and distortion (high-rate theory), and contrast those with the performance of popular quantization schemes (absmax INT and floating-point (FP)), for which we also derive accurate heuristic approximations. Part II of this paper studies the weight-only quantization setup where second-order statistics of the activation matrices are available at the encoder.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.