High-Rate Quantized Matrix Multiplication I
Or Ordentlich, Yury Polyanskiy

TL;DR
This paper explores quantized matrix multiplication for efficient large language model deployment, analyzing theoretical tradeoffs and practical schemes, with a focus on high-rate quantization and weight-only quantization scenarios.
Contribution
It provides a theoretical framework for quantization tradeoffs and evaluates popular schemes without prior statistical calibration, including new heuristic approximations.
Findings
High-rate theory characterizes quantization-distortion tradeoffs.
Absmax INT and FP quantization schemes are compared with heuristic models.
Second-order statistics enable improved weight-only quantization.
Abstract
This paper investigates the problem of quantized matrix multiplication (MatMul), which has become crucial for the efficient deployment of large language models (LLMs). We consider a Generic MatMul setting, where both matrices must be quantized (weight+activation quantization) without specific apriori (calibration) statistical information about the factors. We review the fundamental information-theoretic tradeoff between quantization rate and distortion (high-rate theory), and contrast those with the performance of popular quantization schemes (absmax INT and floating-point (FP)), for which we also derive accurate heuristic approximations. Part II of this paper studies the weight-only quantization setup where second-order statistics of the activation matrices are available at the encoder.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
