BAQ: Efficient Bit Allocation Quantization for Large Language Models
Chao Zhang, Li Wang, Samson Lasaulce, and Merouane Debbah

TL;DR
This paper introduces BAQ, a novel bit allocation quantization method for large language models that optimally assigns bitwidths based on sensitivity, significantly reducing perplexity compared to existing methods.
Contribution
The paper presents a convex optimization-based framework for adaptive bitwidth allocation in quantization, with a closed-form solution and an efficient algorithm called BAQ.
Findings
BAQ outperforms GPTQ with up to 56× lower perplexity.
The method effectively balances loss minimization and computational complexity.
Experimental validation on models from 125M to 30B parameters demonstrates broad applicability.
Abstract
Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods rely on uniform or heuristic bitwidth assignments, failing to account for the nonuniform sensitivity of weights to quantization noise. In this paper, we propose a novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy. We make key assumptions, which allow the layer/component-wise loss function to be expressed as an explicit function of the bitwidths. This enables a neat formulation of the bit allocation problem as a convex optimization task, whose closed-form solution adapts precision across weights to minimize the layer-wise quantization loss. Inspecting the solution provides several insights (such as the equal-loss structure), which are then exploited to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
+ The new formulation could be of interest.
- Fine-grain, ungrouped mixed precision quantization is not hardware-acceleration friendly, limiting the practical value, which is not convincingly established empirically here either.
- The paper is well written, well organized, and easy to follow - The theoretical formulation to derive the optimal bitwidth allocation as a function of layer loss and sensitivity is elegant and well grounded - The investigation encompasses large scale LLMs with 7-30B parameters, highly relevant to practical workloads nowadays - An interesting link is drawn between dispersion of the sensitivity coefficients and BAQ effectiveness - The authors share their code and analytical derivations
- A key limitation is that, contrary to what is stated, the technique is very _unfriendly_ to hardware as it requires assignment of independent bitwidths to each matrix column. This is not supported in today's accelerators. Running kernels where each column uses a different bitwidth would require custom per-column packing/unpacking logic, adding latency and potentially memory overhead - In the 3-bit regime, perplexity/accuracy results are comparable to GPTQ. At 2 bits, although BAQ improves sig
The method provides a theoretically grounded formulation of mixed-precision quantization through the equal-loss principle and derives an analytical optimality condition with a closed-form solution rather than relying on heuristics, making it a novel and well-established approach to bit allocation in LLM quantization.
- Although bits are allocated at the column level, it is unclear how this design leads to real hardware acceleration; the approach appears focused mainly on memory footprint reduction rather than compute efficiency. Hardware constraints and deployment aspects are not discussed, leaving practical feasibility uncertain. -The comparative analysis is limited and lacks sufficient breadth. In particular, the paper omits comparisons with recent mixed-precision quantization methods that optimize under
The paper is well-written and easy to follow. The bit allocation formulation seems to be correct, and the corresponding solution (mathematical derivation) is correct.
1. The major concern is that the reported performance is not good. From the results in Table 1, I could observe the improvement when GPTQ is combined with BAQ. However, the final perplexity scores are not good even for large models (e.g., Llama2-7B, Llama2-13B), when compared to the recent quantization methods such as aespa [1], AutoRound [2], BoA [3], and GPTAQ [4]. Please compare the performance with these recent baselines, and also integrate BAQ into these methods to show the validity of the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Advanced Neural Network Applications
