Optimal Formats for Weight Quantisation
Douglas Orr, Luka Ribar, Carlo Luschi

TL;DR
This paper introduces a systematic framework for designing weight quantisation formats that leverage variable-length coding, leading to improved model compression and efficiency in deep learning models.
Contribution
It connects quantisation format design with classical theory, develops non-linear quantisation curves, and derives optimal bit-width allocation across model layers.
Findings
Variable-length formats outperform fixed-length formats.
Optimal bit-width allocation saves up to 0.25 bits per parameter.
Formats exploiting variable-length encoding improve model efficiency.
Abstract
Weight quantisation is an essential technique for enabling efficient training and deployment of modern deep learning models. However, the recipe book of quantisation formats is large and formats are often chosen empirically. In this paper, we propose a framework for systematic design and analysis of quantisation formats. By connecting the question of format design with the classical quantisation theory, we show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes. We frame the problem as minimising the KL divergence between original and quantised model outputs under a model size constraint, which can be approximated by minimising the squared quantisation error, a well-studied problem where entropy-constrained quantisers with variable-length codes are optimal. We develop non-linear quantisation curves for…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper provides a solid theoretical perspective by linking neural network quantization with classical information theory, offering useful insights into why certain formats perform well. It introduces new scaling schemes and a Fisher information based bit allocation rule that appear to improve efficiency across model tensors. Experiments on several large language models support the proposed framework and suggest its potential practical value.
I have the following concerns about the paper: 1. Equations (1) and (2) require stronger justification: Minimizing the KL divergence does not necessarily guarantee that the model’s accuracy will be preserved, and the validity of Equation (2) needs a clearer theoretical explanation. 2. Additional background is needed: to help readers follow the technical development. For instance, the sections around lines 143–146 and 153–157 would benefit from more context and introductory material. 3. Experi
- The paper establishes a principled theoretical framework for analyzing quantization formats by reducing the problem to Fisher-information-weighted squared quantization error, enabling the application of classical quantization theory to neural network weight compression. This is a significant, principled contribution that enables systematic format design rather than ad-hoc heuristics. - The power of this theoretical framework is demonstrated by the authors' ability to directly leverage the rich
- The authors only empirically investigate transformer LLMs, and even among these, Gemma models exhibit behavior that deviates from the theoretical predictions. This raises concerns about the generality of the framework. If discrepancies arise within transformer LLMs alone, it is unclear how well the insights would extend to other architectures such as CNNs, GNNs, or state-space models. - The theoretical framework relies on three approximations: second-order Taylor expansion of KL divergence, di
- The paper advances the formalization of quantization data format selection, which is often tackled only empirically; - Mathematical background seems solid, the SoTA seems adequately cited; - Supplementary materials is rich and support some asumptions and approximations made in the paper.
- I found the paper hard to read and follow. Overall structure could be improved. The figures are way out of place with the text. Some figures mentioned in the main text are missing (figure 8, 29, 33...) only to be found in the supplementary material. - Overall, it seems to me that the main paper is not entirely self-supporting without the help of supplementary material. - Actual quantization results on LLMs models are completely absent from the paper, and again, can be found only in supplementa
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Digital Filter Design and Implementation
